Squashed 'PyTorch/Translation/GNMT/' changes from 4dc145a..51a90b1

51a90b1 Feb 14, 2019 update git-subtree-dir: PyTorch/Translation/GNMT git-subtree-split: 51a90b1667e7c1d45bd68bf719f8bf5d4c4521e3
2019-02-14 12:40:30 +01:00 · 2019-02-14 12:40:30 +01:00 · 8efbea403f
parent 8f95a78af2
commit 8efbea403f
35 changed files with 1939 additions and 1340 deletions
--- a/.gitignore
+++ b/.gitignore
@ -4,3 +4,4 @@ tags
 /results
 /data
 .DS_Store
+.rsyncignore
--- a/2
+++ b/2
@ -1,4 +1,4 @@
-FROM nvcr.io/nvidia/pytorch:18.06-py3
+FROM nvcr.io/nvidia/pytorch:19.01-py3

 ENV LANG C.UTF-8
 ENV LC_ALL C.UTF-8
--- a/README.md
+++ b/README.md
@ -28,20 +28,37 @@ and
  * 4-layer LSTM, hidden size 1024, first layer is bidirectional, the rest are
    unidirectional
  * with residual connections starting from 3rd layer
-  * uses LSTM layer accelerated by cuDNN
+  * uses standard pytorch nn.LSTM layer
+  * dropout is applied on input to all LSTM layers, probability of dropout is
+    set to 0.2
+  * hidden state of LSTM layers is initialized with zeros
+  * weights and bias of LSTM layers is initialized with uniform(-0.1, 0.1)
+    distribution
 * decoder:
  * 4-layer unidirectional LSTM with hidden size 1024 and fully-connected
    classifier
  * with residual connections starting from 3rd layer
-  * uses LSTM layer accelerated by cuDNN
+  * uses standard pytorch nn.LSTM layer
+  * dropout is applied on input to all LSTM layers, probability of dropout is
+    set to 0.2
+  * hidden state of LSTM layers is initialized with zeros
+  * weights and bias of LSTM layers is initialized with uniform(-0.1, 0.1)
+    distribution
+  * weights and bias of fully-connected classifier is initialized with
+    uniform(-0.1, 0.1) distribution
 * attention:
  * normalized Bahdanau attention
  * output from first LSTM layer of decoder goes into attention,
  then re-weighted context is concatenated with the input to all subsequent
  LSTM layers of the decoder at the current timestep
+  * linear transform of keys and queries is initialized with uniform(-0.1, 0.1),
+  normalization scalar is initialized with 1.0 / sqrt(1024),
+    normalization bias is initialized with zero
 * inference:
  * beam search with default beam size of 5
-  * with coverage penalty and length normalization terms
+  * with coverage penalty and length normalization, coverage penalty factor is
+    set to 0.1, length normalization factor is set to 0.6 and length
+    normalization constant is set to 5.0
  * detokenized BLEU computed by [SacreBLEU](https://github.com/awslabs/sockeye/tree/master/contrib/sacrebleu)
  * [motivation](https://github.com/awslabs/sockeye/tree/master/contrib/sacrebleu#motivation) for choosing SacreBLEU

@ -53,12 +70,11 @@ Our experiments show that a 4-layer model is significantly faster to train and
 yields comparable accuracy on the public
 [WMT16 English-German](http://www.statmt.org/wmt16/translation-task.html)
 dataset. The number of LSTM layers is controlled by the `num_layers` parameter
-in the `scripts/train.sh` training script.
+in the `train.py` training script.

 # Setup
 ## Requirements
-* [PyTorch 18.06-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)
-(or newer)
+* [PyTorch 19.01-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)
 * [SacreBLEU 1.2.10](https://pypi.org/project/sacrebleu/1.2.10/)

 This repository contains `Dockerfile` which extends the PyTorch NGC container
@ -76,7 +92,7 @@ and
 Before you can train using mixed precision with Tensor Cores, ensure that you
 have a
 [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
-based GPU.
+based GPU. Other platforms might likely work but aren't officially supported.
 For information about how to train using mixed precision, see the
 [Mixed Precision Training paper](https://arxiv.org/abs/1710.03740)
 and
@ -109,15 +125,33 @@ By default, the training script will use all available GPUs. The training script
 saves only one checkpoint with the lowest value of the loss function on the
 validation dataset. All results and logs are saved to the `results` directory
 (on the host) or to the `/workspace/gnmt/results` directory (in the container).
-By default, the `scripts/train.sh` script will launch mixed precision training
+By default, the `train.py` script will launch mixed precision training
 with Tensor Cores. You can change this behaviour by setting the `--math fp32`
-flag in the `scripts/train.sh` script.
+flag for the `train.py` training script.
+
+Launching training on 1, 4 or 8 GPUs:
+
 ```
-bash scripts/train.sh
+python3 -m launch train.py --seed 2 --train-global-batch-size 1024
 ```
+
+Launching training on 16 GPUs:
+
+```
+python3 -m launch train.py --seed 2 --train-global-batch-size 2048
+```
+
+By default the training script will launch training with batch size 128 per GPU.
+If specified `--train-global-batch-size` is larger than 128 times the number of
+GPUs available for the training then the training script will accumulate
+gradients over consecutive iterations and then perform the weight update.
+For example 1 GPU training with `--train-global-batch-size 1024` will accumulate
+gradients over 8 iterations before doing the weight update with accumulated
+gradients.
+
 The training script automatically runs the validation and testing after each
 training epoch. The results from the validation and testing are printed to
-the standard error (stderr) and saved to log files.
+the standard output (stdout) and saved to log files.

 The summary after each training epoch is printed in the following format:
 ```
@ -145,21 +179,24 @@ Our download script is very similar to the `wmt16_en_de.sh` script from the
 [tensorflow/nmt](https://github.com/tensorflow/nmt/blob/master/nmt/scripts/wmt16_en_de.sh)
 repository. Our download script contains an extra preprocessing step, which
 discards all pairs of sentences which can't be decoded by *latin-1* encoder.
-
 The `scripts/wmt16_en_de.sh` script uses the
 [subword-nmt](https://github.com/rsennrich/subword-nmt)
 package to segment text into subword units (BPE). By default, the script builds
 the shared vocabulary of 32,000 tokens.

+In order to test with other datasets, scripts need to be customized accordingly.
+
 ## Running training
 The default training configuration can be launched by running the
-`scripts/train.sh` training script.
+`train.py` training script.
 By default, the training script saves only one checkpoint with the lowest value
 of the loss function on the validation dataset, an evaluation is performed after
 each training epoch. Results are stored in the `results/gnmt_wmt16` directory.

 The training script launches data-parallel training with batch size 128 per GPU
-on all available GPUs. After each training epoch, the script runs an evaluation
+on all available GPUs. We have tested reliance on up to 16 GPUs on a single
+node.
+After each training epoch, the script runs an evaluation
 on the validation dataset and outputs a BLEU score on the test dataset
 (*newstest2014*). BLEU is computed by the
 [SacreBLEU](https://github.com/awslabs/sockeye/tree/master/contrib/sacrebleu)
@ -171,12 +208,11 @@ behavior by setting the `CUDA_VISIBLE_DEVICES` variable in your environment or
 by setting the `NV_GPU` variable at the Docker container launch
 ([see section "GPU isolation"](https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker#gpu-isolation)).

-By default, the `scripts/train.sh` script will launch mixed precision training
+By default, the `train.py` script will launch mixed precision training
 with Tensor Cores. You can change this behaviour by setting the `--math fp32`
-flag in the `scripts/train.sh` script.
+flag for the `train.py` script.

-Internally, the `scripts/train.sh` script uses `train.py`. To view all available
-options for training, run `python3 train.py --help`.
+To view all available options for training, run `python3 train.py --help`.

 ## Running inference
 Inference can be run by launching the `translate.py` inference script, although,
@ -188,93 +224,129 @@ normalization term. Greedy decoding can be enabled by setting the beam size to 1

 To view all available options for inference, run `python3 translate.py --help`.

-## Benchmarking scripts
-### Training performance benchmark
-The `scripts/benchmark_training.sh` benchmarking script runs a few, relatively
-short training sessions and automatically collects performance numbers. The
-benchmarking script assumes that the `scripts/wmt16_en_de.sh` data download
-script was launched and the datasets are available in the default location
-(`data` directory).
-
-Results from the benchmark are stored in the `results` directory. After the
-benchmark is done, you can launch the `scripts/parse_train_benchmark.sh` script
-to generate a short summary which will contain launch configuration, performance
-(in tokens per second), and estimated training time needed for one epoch (in
-seconds).
-
-### Inference performance and accuracy benchmark
-The `scripts/benchmark_inference.sh` benchmarking script launches a number of
-inference runs with different hyperparameters (beam size, batch size, arithmetic
-type) on sorted and unsorted *newstest2014* test dataset. Performance and
-accuracy results are stored in the `results/inference_benchmark` directory.
-BLEU score is computed by the SacreBLEU package.
-
-The `scripts/benchmark_inference.sh` script assumes that the
-`scripts/wmt16_en_de.sh` data download script was
-launched and the datasets are available in the default location (`data`
-directory).
-
-The `scripts/benchmark_inference.sh` script requires a pre-trained
-model checkpoint. By default, the script is loading a checkpoint from the
-`results/gnmt_wmt16/model_best.pth` location.
-
 ## Training Accuracy Results
-All results were obtained by running the `scripts/train.sh` script in
-the pytorch-18.06-py3 Docker container on NVIDIA DGX-1 with 8 V100 16G GPUs.
+Results were obtained by running the `train.py` script with the default
+batch size = 128 per GPU in the pytorch-19.01-py3 Docker container.

+### NVIDIA DGX-1 (8x Tesla V100 16G)
+Command used to launch the training:

-| **number of GPUs** | **mixed precision BLEU** | **fp32 BLEU** | **mixed precision training time** | **fp32 training time** |
-| ------------------ | ------------------------ | ------------- | --------------------------------- | ---------------------- |
-|         1          |        22.54		          |     22.25		  |          412 minutes              |       948 minutes      |
-|         4          |        22.45		          |     22.46		  |          118 minutes              |       264 minutes      |
-|         8          |        22.41		          |     22.43		  |          64 minutes               |       139 minutes      |
+```
+python3 -m launch train.py --seed 2 --train-global-batch-size 1024
+```
+
+| **number of GPUs** | **batch size/GPU** | **mixed precision BLEU** | **fp32 BLEU** | **mixed precision training time** | **fp32 training time** |
+| --- | --- | ----- | ----- | ------------- | ------------- |
+|  1  | 128 | 24.59 | 24.71 | 264.4 minutes | 824.4 minutes |
+|  4  | 128 | 24.30 | 24.45 | 89.5 minutes  | 230.8 minutes |
+|  8  | 128 | 24.45 | 24.48 | 46.2 minutes  | 116.6 minutes |
+
+### NVIDIA DGX-2 (16x Tesla V100 32G)
+Commands used to launch the training:
+
+```
+for 1,4,8 GPUs:
+python3 -m launch train.py --seed 2 --train-global-batch-size 1024
+for 16 GPUs:
+python3 -m launch train.py --seed 2 --train-global-batch-size 2048
+```
+
+| **number of GPUs** | **batch size/GPU** | **mixed precision BLEU** | **fp32 BLEU** | **mixed precision training time** | **fp32 training time** |
+| --- | --- | ----- | ----- | ------------- | ------------- |
+| 1   | 128 | 24.59 | 24.71 | 265.0 minutes | 825.1 minutes |
+| 4   | 128 | 24.69 | 24.33 | 87.4 minutes  | 216.3 minutes |
+| 8   | 128 | 24.50 | 24.47 | 49.6 minutes  | 113.5 minutes |
+| 16  | 128 | 24.22 | 24.16 | 26.3 minutes  | 58.6 minutes  |

 ![TrainingLoss](./img/training_loss.png)

 ### Training Stability Test
-The GNMT v2 model was trained for 10 epochs, starting from 96 different initial
+The GNMT v2 model was trained for 6 epochs, starting from 50 different initial
 random seeds. After each training epoch the model was evaluated on the test
 dataset and the BLEU score was recorded. The training was performed in the
-pytorch-18.06-py3 Docker container on NVIDIA DGX-1 with 8 V100 16G GPUs. The
-following table summarizes results of the stability test.
+pytorch-19.01-py3 Docker container on NVIDIA DGX-1 with 8 Tesla V100 16G GPUs.
+The following table summarizes results of the stability test.

 ![TrainingAccuracy](./img/training_accuracy.png)

-## Training Performance Results
-All results were obtained by running the `scripts/train.sh` training script in
-the pytorch-18.06-py3 Docker container on NVIDIA DGX-1 with 8 V100 16G GPUs.
-Performance numbers (in tokens per second) were averaged over an entire training
-epoch.
+#### BLEU scores after each training epoch for different initial random seeds
+| **epoch** | **average** | **stdev** | **minimum** | **maximum** | **median** |
+| --- | ------ | ----- | ------ | ------ | ------ |
+|  1  | 19.954 | 0.326 | 18.710 | 20.490 | 20.020 |
+|  2  | 21.734 | 0.222 | 21.220 | 22.120 | 21.765 |
+|  3  | 22.502 | 0.223 | 21.960 | 22.970 | 22.485 |
+|  4  | 23.004 | 0.221 | 22.350 | 23.430 | 23.020 |
+|  5  | 24.201 | 0.146 | 23.900 | 24.480 | 24.215 |
+|  6  | 24.423 | 0.159 | 24.070 | 24.820 | 24.395 |

-| **number of GPUs** | **mixed precision tokens/s** | **fp32 tokens/s** | **mixed precision speedup** | **mixed precision multi-gpu weak scaling** | **fp32 multi-gpu weak scaling** |
-| -------- | ------------- | ------------- | ------------ | --------------------------- | --------------------------- |
-|    1     |    42337      |   18581       |   2.279      |        1.000                |        1.000                |
-|    4     |    153433     |   67586       |   2.270      |        3.624                |        3.637                |
-|    8     |    300181     |   132734      |   2.262      |        7.090                |        7.144                |
+
+## Training Performance Results
+All results were obtained by running the `train.py` training script in the
+pytorch-19.01-py3 Docker container. Performance numbers (in tokens per second)
+were averaged over an entire training epoch.
+
+### NVIDIA DGX-1 (8x Tesla V100 16G)
+
+| **number of GPUs** | **batch size/GPU** | **mixed precision tokens/s** | **fp32 tokens/s** | **mixed precision speedup** | **mixed precision multi-gpu strong scaling** | **fp32 multi-gpu strong scaling** |
+| --- | --- | ------ | ------ | ----- | ----- | ----- |
+|  1  | 128 | 66050  | 21346  | 3.094 | 1.000  | 1.000|
+|  4  | 128 | 196174 | 76083  | 2.578 | 2.970  | 3.564|
+|  8  | 128 | 387282 | 153697 | 2.520 | 5.863  | 7.200|
+
+
+### NVIDIA DGX-2 (16x Tesla V100 32G)
+
+| **number of GPUs** | **batch size/GPU** | **mixed precision tokens/s** | **fp32 tokens/s** | **mixed precision speedup** | **mixed precision multi-gpu strong scaling** | **fp32 multi-gpu strong scaling** |
+| --- | --- | ------ | ------- | ----- | ------ | ------ |
+|  1  | 128 | 65830  | 22695  | 2.901 | 1.000   | 1.000  |
+|  4  | 128 | 200886 | 81224  | 2.473 | 3.052   | 3.579  |
+|  8  | 128 | 362612 | 156536 | 2.316 | 5.508   | 6.897  |
+| 16  | 128 | 738521 | 314831 | 2.346 | 11.219  | 13.872 |

 ## Inference Performance Results
-All results were obtained by running the `scripts/benchmark_inference.sh`
-benchmarking script in the pytorch-18.06-py3 Docker container on NVIDIA DGX-1.
-Inference was run on a single V100 16G GPU.
+All results were obtained by running the `translate.py` script in the
+pytorch-19.01-py3 Docker container on NVIDIA DGX-1. Inference benchmark was run
+on a single Tesla V100 16G GPU. The benchmark requires a checkpoint from a fully
+trained model.
+
+Command to launch the inference benchmark:
+```
+python3 translate.py --input data/wmt16_de_en/newstest2014.tok.bpe.32000.en \
+  --reference data/wmt16_de_en/newstest2014.de --output /tmp/output \
+  --model results/gnmt/model_best.pth --batch-size 32 128 512 \
+  --beam-size 1 2 5 10 --math fp16 fp32
+```

 | **batch size** | **beam size** | **mixed precision BLEU** | **fp32 BLEU** | **mixed precision tokens/s** | **fp32 tokens/s** |
-| -------------- | ------------- | ------------- | ------------- | ----------------- | ------------ |
-|      512       |       1       |     20.63     |    20.63      |     62009  	     |    31229     |
-|      512       |       2       |     21.55     |    21.60      |     32669  	     |    16454     |
-|      512       |       5       |     22.34     |    22.36      |     21105  	     |    8562      |
-|      512       |       10      |     22.34     |    22.40      |     12967  	     |    4720      |
-|      128       |       1       |     20.62     |    20.63      |     27095  	     |    19505     |
-|      128       |       2       |     21.56     |    21.60      |     13224  	     |    9718      |
-|      128       |       5       |     22.38     |    22.36      |     10987  	     |    6575      |
-|      128       |       10      |     22.35     |    22.40      |     8603          |    4103      |
-|      32        |       1       |     20.62     |    20.63      |     9451   	     |    8483      |
-|      32        |       2       |     21.56     |    21.60      |     4818          |    4333      |
-|      32        |       5       |     22.34     |    22.36      |     4505          |    3655      |
-|      32        |       10      |     22.37     |    22.40      |     4086          |    2822      |
+| ---- | ----- | ------- | ------- | ---------|-------- |
+|  32  |   1   |  23.18  |  23.18  |  23571   |  19462  |
+|  32  |   2   |  24.09  |  24.12  |  15303   |  12345  |
+|  32  |   5   |  24.63  |  24.62  |  13644   |  7725   |
+|  32  |   10  |  24.50  |  24.48  |  11049   |  5359   |
+|  128 |   1   |  23.17  |  23.18  |  73429   |  42272  |
+|  128 |   2   |  24.07  |  24.12  |  43373   |  23131  |
+|  128 |   5   |  24.69  |  24.63  |  29646   |  12525  |
+|  128 |   10  |  24.45  |  24.48  |  19100   |  6886   |
+|  512 |   1   |  23.17  |  23.18  |  135333  |  48962  |
+|  512 |   2   |  24.08  |  24.12  |  74367   |  27308  |
+|  512 |   5   |  24.60  |  24.63  |  39217   |  12674  |
+|  512 |   10  |  24.54  |  24.48  |  21433   |  6640   |
+

 # Changelog
 1. Aug 7, 2018
  * Initial release
+2. Dec 4, 2018
+  * Added exponential warm-up and step learning rate decay
+  * Multi-GPU (distributed) inference and validation
+  * Default container updated to PyTorch 18.11-py3
+  * General performance improvements
+3. Feb 14, 2019
+  * Different batching algorithm (bucketing with 5 equal-width buckets)
+  * Additional dropouts before first LSTM layer in encoder and in decoder
+  * Weight initialization changed to uniform (-0.1, 0.1)
+  * Switched order of dropout and concatenation with attention in decoder
+  * Default container updated to PyTorch 19.01-py3

 # Known issues
 None
--- a/img/training_accuracy.png
+++ b/img/training_accuracy.png
--- a/img/training_loss.png
+++ b/img/training_loss.png
--- a/launch.py
+++ b/launch.py
@ -0,0 +1,235 @@
+r"""
+`torch.distributed.launch` is a module that spawns up multiple distributed
+training processes on each of the training nodes.
+
+The utility can be used for single-node distributed training, in which one or
+more processes per node will be spawned. The utility can be used for either
+CPU training or GPU training. If the utility is used for GPU training,
+each distributed process will be operating on a single GPU. This can achieve
+well-improved single-node training performance. It can also be used in
+multi-node distributed training, by spawning up multiple processes on each node
+for well-improved multi-node distributed training performance as well.
+This will especially be benefitial for systems with multiple Infiniband
+interfaces that have direct-GPU support, since all of them can be utilized for
+aggregated communication bandwidth.
+
+In both cases of single-node distributed training or multi-node distributed
+training, this utility will launch the given number of processes per node
+(``--nproc_per_node``). If used for GPU training, this number needs to be less
+or euqal to the number of GPUs on the current system (``nproc_per_node``),
+and each process will be operating on a single GPU from *GPU 0 to
+GPU (nproc_per_node - 1)*.
+
+**How to use this module:**
+
+1. Single-Node multi-process distributed training
+
+::
+
+    >>> python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
+               YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3 and all other
+               arguments of your training script)
+
+2. Multi-Node multi-process distributed training: (e.g. two nodes)
+
+
+Node 1: *(IP: 192.168.1.1, and has a free port: 1234)*
+
+::
+
+    >>> python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
+               --nnodes=2 --node_rank=0 --master_addr="192.168.1.1"
+               --master_port=1234 YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3
+               and all other arguments of your training script)
+
+Node 2:
+
+::
+
+    >>> python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
+               --nnodes=2 --node_rank=1 --master_addr="192.168.1.1"
+               --master_port=1234 YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3
+               and all other arguments of your training script)
+
+3. To look up what optional arguments this module offers:
+
+::
+
+    >>> python -m torch.distributed.launch --help
+
+
+**Important Notices:**
+
+1. This utilty and multi-process distributed (single-node or
+multi-node) GPU training currently only achieves the best performance using
+the NCCL distributed backend. Thus NCCL backend is the recommended backend to
+use for GPU training.
+
+2. In your training program, you must parse the command-line argument:
+``--local_rank=LOCAL_PROCESS_RANK``, which will be provided by this module.
+If your training program uses GPUs, you should ensure that your code only
+runs on the GPU device of LOCAL_PROCESS_RANK. This can be done by:
+
+Parsing the local_rank argument
+
+::
+
+    >>> import argparse
+    >>> parser = argparse.ArgumentParser()
+    >>> parser.add_argument("--local_rank", type=int)
+    >>> args = parser.parse_args()
+
+Set your device to local rank using either
+
+::
+
+    >>> torch.cuda.set_device(arg.local_rank)  # before your code runs
+
+or
+
+::
+
+    >>> with torch.cuda.device(arg.local_rank):
+    >>>    # your code to run
+
+3. In your training program, you are supposed to call the following function
+at the beginning to start the distributed backend. You need to make sure that
+the init_method uses ``env://``, which is the only supported ``init_method``
+by this module.
+
+::
+
+    torch.distributed.init_process_group(backend='YOUR BACKEND',
+                                         init_method='env://')
+
+4. In your training program, you can either use regular distributed functions
+or use :func:`torch.nn.parallel.DistributedDataParallel` module. If your
+training program uses GPUs for training and you would like to use
+:func:`torch.nn.parallel.DistributedDataParallel` module,
+here is how to configure it.
+
+::
+
+    model = torch.nn.parallel.DistributedDataParallel(model,
+                                                      device_ids=[arg.local_rank],
+                                                      output_device=arg.local_rank)
+
+Please ensure that ``device_ids`` argument is set to be the only GPU device id
+that your code will be operating on. This is generally the local rank of the
+process. In other words, the ``device_ids`` needs to be ``[args.local_rank]``,
+and ``output_device`` needs to be ``args.local_rank`` in order to use this
+utility
+
+.. warning::
+
+    ``local_rank`` is NOT globally unique: it is only unique per process
+    on a machine.  Thus, don't use it to decide if you should, e.g.,
+    write to a networked filesystem.  See
+    https://github.com/pytorch/pytorch/issues/12042 for an example of
+    how things can go wrong if you don't do this correctly.
+
+"""
+
+
+import sys
+import subprocess
+import os
+import socket
+from argparse import ArgumentParser, REMAINDER
+
+import torch
+
+
+def parse_args():
+    """
+    Helper function parsing the command line options
+    @retval ArgumentParser
+    """
+    parser = ArgumentParser(description="PyTorch distributed training launch "
+                                        "helper utilty that will spawn up "
+                                        "multiple distributed processes")
+
+    # Optional arguments for the launch helper
+    parser.add_argument("--nnodes", type=int, default=1,
+                        help="The number of nodes to use for distributed "
+                             "training")
+    parser.add_argument("--node_rank", type=int, default=0,
+                        help="The rank of the node for multi-node distributed "
+                             "training")
+    parser.add_argument("--nproc_per_node", type=int, default=None,
+                        help="The number of processes to launch on each node, "
+                             "for GPU training, this is recommended to be set "
+                             "to the number of GPUs in your system so that "
+                             "each process can be bound to a single GPU.")
+    parser.add_argument("--master_addr", default="127.0.0.1", type=str,
+                        help="Master node (rank 0)'s address, should be either "
+                             "the IP address or the hostname of node 0, for "
+                             "single node multi-proc training, the "
+                             "--master_addr can simply be 127.0.0.1")
+    parser.add_argument("--master_port", default=29500, type=int,
+                        help="Master node (rank 0)'s free port that needs to "
+                             "be used for communciation during distributed "
+                             "training")
+
+    # positional
+    parser.add_argument("training_script", type=str,
+                        help="The full path to the single GPU training "
+                             "program/script to be launched in parallel, "
+                             "followed by all the arguments for the "
+                             "training script")
+
+    # rest from the training program
+    parser.add_argument('training_script_args', nargs=REMAINDER)
+    return parser.parse_args()
+
+
+def main():
+    args = parse_args()
+
+    if args.nproc_per_node is None:
+        args.nproc_per_node = torch.cuda.device_count()
+
+    # world size in terms of number of processes
+    dist_world_size = args.nproc_per_node * args.nnodes
+
+    # set PyTorch distributed related environmental variables
+    current_env = os.environ.copy()
+    current_env["MASTER_ADDR"] = args.master_addr
+    current_env["MASTER_PORT"] = str(args.master_port)
+    current_env["WORLD_SIZE"] = str(dist_world_size)
+
+    processes = []
+
+    for local_rank in range(0, args.nproc_per_node):
+        # each process's rank
+        dist_rank = args.nproc_per_node * args.node_rank + local_rank
+        current_env["RANK"] = str(dist_rank)
+
+        # spawn the processes
+        cmd = [sys.executable,
+               "-u",
+               args.training_script,
+               "--local_rank={}".format(local_rank)] + args.training_script_args
+
+        process = subprocess.Popen(cmd, env=current_env)
+        processes.append(process)
+
+    returncode = 0
+    try:
+        for process in processes:
+            process_returncode = process.wait()
+            if process_returncode != 0:
+                returncode = 1
+    except KeyboardInterrupt:
+        print('CTRL-C, TERMINATING WORKERS ...')
+        for process in processes:
+            process.terminate()
+        for process in processes:
+            process.wait()
+        raise
+
+    sys.exit(returncode)
+
+
+if __name__ == "__main__":
+    main()
--- a/multiproc.py
+++ b/multiproc.py
@ -1,46 +0,0 @@
-import sys
-import subprocess
-
-import torch
-
-def main():
-    argslist = list(sys.argv)[1:]
-    world_size = torch.cuda.device_count()
-
-    if '--world-size' in argslist:
-        argslist[argslist.index('--world-size') + 1] = str(world_size)
-    else:
-        argslist.append('--world-size')
-        argslist.append(str(world_size))
-
-    workers = []
-
-    for i in range(world_size):
-        if '--rank' in argslist:
-            argslist[argslist.index('--rank') + 1] = str(i)
-        else:
-            argslist.append('--rank')
-            argslist.append(str(i))
-        stdout = None if i == 0 else subprocess.DEVNULL
-        worker = subprocess.Popen([str(sys.executable)] + argslist, stdout=stdout)
-        workers.append(worker)
-
-    returncode = 0
-    try:
-        for worker in workers:
-            worker_returncode = worker.wait()
-            if worker_returncode != 0:
-                returncode = 1
-    except KeyboardInterrupt:
-        print('Pressed CTRL-C, TERMINATING')
-        for worker in workers:
-            worker.terminate()
-        for worker in workers:
-            worker.wait()
-        raise
-
-    sys.exit(returncode)
-
-
-if __name__ == "__main__":
-    main()
--- a/requirements.txt
+++ b/requirements.txt
@ -1 +1,2 @@
 sacrebleu==1.2.10
+git+git://github.com/NVIDIA/apex.git#egg=apex
--- a/scripts/benchmark_inference.sh
+++ b/scripts/benchmark_inference.sh
@ -1,101 +0,0 @@
-#!/bin/bash
-
-set -e
-
-DATASET_DIR='data/wmt16_de_en'
-RESULTS_DIR='gnmt_wmt16'
-
-# sort by length (ascending)
-cat ${DATASET_DIR}/newstest2014.tok.bpe.32000.en \
-   | awk '{ print length, $0 }' \
-   | sort -n -s \
-   | cut -d" " -f2- > /tmp/newstest2014.tok.bpe.32000.en.sorted
-
-batches=(512 256 128 64 32)
-beams=(1 2 5 10)
-maths=(fp16 fp32)
-
-model=results/${RESULTS_DIR}/model_best.pth
-
-odir=results/inference_benchmark
-mkdir -p $odir
-
-echo RUNNING on unsorted dataset
-rm -rf $odir/fp16_perf_unsorted.log
-rm -rf $odir/fp32_perf_unsorted.log
-rm -rf $odir/fp16_bleu.log
-rm -rf $odir/fp32_bleu.log
-ifile=${DATASET_DIR}/newstest2014.tok.bpe.32000.en
-rfile=${DATASET_DIR}/newstest2014.de
-
-for math in "${maths[@]}"
-do
-   for batch in "${batches[@]}"
-   do
-      for beam in "${beams[@]}"
-      do
-         echo RUNNING: batch_size: $batch, beam_size: $beam, math: $math
-
-         # run translation
-         python3 translate.py \
-            -i $ifile \
-            -r $rfile \
-            -m $model \
-            --math $math \
-            --print-freq 1 \
-            --beam-size $beam \
-            --batch-size $batch \
-            -o /tmp/output.tok &> /tmp/log.log
-
-         tok_per_sec=`cat /tmp/log.log \
-            |grep "Avg total tokens" \
-            |cut -f 2 \
-            |cut -d ':' -f 2 |tr -d ' '`
-
-         bleu=`cat /tmp/log.log \
-            |grep BLEU \
-            |cut -d ':' -f 2 |tr -d ' '`
-
-         echo -e $tok_per_sec '\t\t batch: '$batch 'beam: ' $beam >> $odir/${math}_perf_unsorted.log
-         echo -e $bleu '\t\t batch: '$batch 'beam: ' $beam  >> $odir/${math}_bleu.log
-         echo Tokens per second: $tok_per_sec, BLEU: $bleu
-      done
-   done
-done
-
-
-echo RUNNING on sorted dataset
-rm -rf $odir/fp16_perf_sorted.log
-rm -rf $odir/fp32_perf_sorted.log
-ifile=/tmp/newstest2014.tok.bpe.32000.en.sorted
-
-
-for math in "${maths[@]}"
-do
-   for batch in "${batches[@]}"
-   do
-      for beam in "${beams[@]}"
-      do
-         echo RUNNING: batch_size: $batch, beam_size: $beam, math: $math
-
-         # run translation
-         python3 translate.py \
-            -i $ifile \
-            -m $model \
-            --math $math \
-            --print-freq 1 \
-            --beam-size $beam \
-            --batch-size $batch \
-            --no-bleu \
-            -o /tmp/output.tok &> /tmp/log.log
-
-         tok_per_sec=`cat /tmp/log.log \
-            |grep "Avg total tokens" \
-            |cut -f 2 \
-            |cut -d ':' -f 2 |tr -d ' '`
-
-         echo -e $tok_per_sec '\t\t batch: '$batch 'beam: ' $beam >> $odir/${math}_perf_sorted.log
-         echo Tokens per second: $tok_per_sec
-      done
-   done
-done
--- a/scripts/benchmark_training.sh
+++ b/scripts/benchmark_training.sh
@ -1,28 +0,0 @@
-#!/bin/bash
-
-DATASET_DIR='data/wmt16_de_en'
-
-batches=(128)
-maths=(fp16 fp32)
-gpus=(1 2 4 8)
-
-for math in "${maths[@]}"
-do
-   for batch in "${batches[@]}"
-   do
-      for gpu in "${gpus[@]}"
-      do
-         export CUDA_VISIBLE_DEVICES=`seq -s "," 0 $((gpu - 1))`
-         python3 -m multiproc train.py \
-         --save benchmark_gpu_${gpu}_math_${math}_batch_${batch} \
-         --dataset-dir ${DATASET_DIR} \
-         --seed 1 \
-         --epochs 1 \
-         --math ${math} \
-         --print-freq 1 \
-         --batch-size ${batch} \
-         --disable-eval \
-         --max-size $((512 * ${batch} * ${gpu}))
-      done
-   done
-done
--- a/scripts/filter_dataset.py
+++ b/scripts/filter_dataset.py
@ -1,6 +1,7 @@
 import argparse
 from collections import Counter

+
 def parse_args():
    parser = argparse.ArgumentParser(description='Clean dataset')
    parser.add_argument('-f1', '--file1', help='file1')
@ -12,6 +13,7 @@ def save_output(fname, data):
    with open(fname, 'w') as f:
        f.writelines(data)

+
 def main():
    """
    Discards all pairs of sentences which can't be decoded by latin-1 encoder.
--- a/scripts/parse_train_benchmark.sh
+++ b/scripts/parse_train_benchmark.sh
@ -1,46 +0,0 @@
-#!/bin/bash
-
-batches=(128)
-maths=(fp16 fp32)
-gpus=(1 2 4 8)
-
-sentences=3498161
-
-echo -e [parameters] "\t\t\t" [tokens / s]  [second per epoch]
-
-for batch in "${batches[@]}"
-do
-   for math in "${maths[@]}"
-   do
-      for gpu in "${gpus[@]}"
-      do
-      dir=results/benchmark_gpu_${gpu}_math_${math}_batch_${batch}/
-      if [ ! -d $dir ]; then
-         echo Directory $dir does not exist
-         continue
-      fi
-
-      total_tokens_per_s=0
-      for gpu_id in `seq 0 $((gpu - 1))`
-      do
-         tokens_per_s=`cat ${dir}/log_gpu_${gpu_id}.log \
-            |grep TRAIN \
-            |cut -f 4 \
-            |sed -E -n 's/.*\(([0-9]+)\).*/\1/p' \
-            |tail -n 1`
-         total_tokens_per_s=$((total_tokens_per_s + tokens_per_s))
-      done
-
-      batch_time=`cat ${dir}/log_gpu_0.log \
-         |grep TRAIN \
-         |cut -f 2 \
-         |sed -E -n 's/.*\(([.0-9]+)\).*/\1/p' \
-         |tail -n 1`
-
-      n_batches=$(( $sentences / ($batch * $gpu)))
-      epoch_time=`awk "BEGIN {print $n_batches * $batch_time}"`
-
-      echo -e math: $math batch: $batch gpus: $gpu "\t\t" $total_tokens_per_s "\t" $epoch_time
-      done
-   done
-done
--- a/scripts/tests/reference_performance
+++ b/scripts/tests/reference_performance
@ -1,6 +1,14 @@
-fp16,1,Tesla V100-SXM2-16GB,42337
-fp16,4,Tesla V100-SXM2-16GB,153433
-fp16,8,Tesla V100-SXM2-16GB,300181
-fp32,1,Tesla V100-SXM2-16GB,18581
-fp32,4,Tesla V100-SXM2-16GB,67586
-fp32,8,Tesla V100-SXM2-16GB,132734
+fp16,1,Tesla V100-SXM2-16GB,66050
+fp16,4,Tesla V100-SXM2-16GB,196174
+fp16,8,Tesla V100-SXM2-16GB,387282
+fp32,1,Tesla V100-SXM2-16GB,21346
+fp32,4,Tesla V100-SXM2-16GB,76083
+fp32,8,Tesla V100-SXM2-16GB,153697
+fp16,1,Tesla V100-SXM3-32GB,65830
+fp16,4,Tesla V100-SXM3-32GB,200886
+fp16,8,Tesla V100-SXM3-32GB,362612
+fp16,16,Tesla V100-SXM3-32GB,738521
+fp32,1,Tesla V100-SXM3-32GB,22695
+fp32,4,Tesla V100-SXM3-32GB,81224
+fp32,8,Tesla V100-SXM3-32GB,156536
+fp32,16,Tesla V100-SXM3-32GB,314831
--- a/scripts/tests/train_1epoch_fp16.sh
+++ b/scripts/tests/train_1epoch_fp16.sh
@ -3,81 +3,35 @@
 set -e

 DATASET_DIR='data/wmt16_de_en'
-RESULTS_DIR='gnmt_wmt16_test'
-REFERENCE_FILE=scripts/tests/reference_performance
-LOGFILE=results/${RESULTS_DIR}/log_gpu_0.log
+REPO_DIR='/workspace/gnmt'
+REFERENCE_FILE=$REPO_DIR/scripts/tests/reference_performance

-REFERENCE_ACCURACY=17.2
 MATH='fp16'
-PERFORMANCE_TOLERANCE=0.9
-
-python3 -m multiproc train.py \
-  --save ${RESULTS_DIR} \
-  --dataset-dir ${DATASET_DIR} \
-  --seed 1 \
-  --epochs 1 \
-  --math ${MATH} \
-  --print-freq 10 \
-  --batch-size 128 \
-  --model-config "{'num_layers': 4, 'hidden_size': 1024, 'dropout':0.2, 'share_embedding': True}" \
-  --optimization-config "{'optimizer': 'Adam', 'lr': 5e-4}"
+PERF_TOLERANCE=0.9

 GPU_NAME=`nvidia-smi --query-gpu=gpu_name --format=csv,noheader |uniq`
 echo 'GPU_NAME:' ${GPU_NAME}
 GPU_COUNT=`nvidia-smi --query-gpu=gpu_name --format=csv,noheader |wc -l`
 echo 'GPU_COUNT:' ${GPU_COUNT}

-# Accuracy test
-ACHIEVED_ACCURACY=`cat ${LOGFILE} \
-   |grep Summary \
-   |tail -n 1 \
-   |cut -f 4 \
-   |egrep -o [0-9.]+`
+REFERENCE_PERF=`grep "${MATH},${GPU_COUNT},${GPU_NAME}" \
+   ${REFERENCE_FILE} | \cut -f 4 -d ','`

-echo 'REFERENCE_ACCURACY:' ${REFERENCE_ACCURACY}
-echo 'ACHIEVED_ACCURACY:' ${ACHIEVED_ACCURACY}
-
-ACCURACY_TEST_RESULT=$(awk 'BEGIN {print ('${ACHIEVED_ACCURACY}' >= '${REFERENCE_ACCURACY}')}')
-
-if (( ${ACCURACY_TEST_RESULT} )); then
-    echo "&&&& ACCURACY TEST PASSED"
-else
-    echo "&&&& ACCURACY TEST FAILED"
-fi
-
-# Performance test
-ACHIEVED_PERFORMANCE=`cat ${LOGFILE} \
-   |grep Performance \
-   |tail -n 1 \
-   |cut -f 2 \
-   |egrep -o [0-9.]+`
-
-REFERENCE_PERFORMANCE=`grep "${MATH},${GPU_COUNT},${GPU_NAME}" ${REFERENCE_FILE} \
-   | \cut -f 4 -d ','`
-
-echo 'REFERENCE_PERFORMANCE:' ${REFERENCE_PERFORMANCE}
-echo 'ACHIEVED_PERFORMANCE:' ${ACHIEVED_PERFORMANCE}
-
-PERFORMANCE_TEST_RESULT=1
-
-if [ -z "${REFERENCE_PERFORMANCE}" ]; then
+if [ -z "${REFERENCE_PERF}" ]; then
   echo "WARNING: COULD NOT FIND REFERENCE PERFORMANCE FOR EXECUTED CONFIG"
-   echo "&&&& PERFORMANCE TEST WAIVED"
+   TARGET_PERF=''
 else
-   PERFORMANCE_TEST_RESULT=$(awk 'BEGIN {print ('${ACHIEVED_PERFORMANCE}' >= \
-      ('${REFERENCE_PERFORMANCE}' * '${PERFORMANCE_TOLERANCE}'))}')
-
-   if (( ${PERFORMANCE_TEST_RESULT} )); then
-      echo "&&&& PERFORMANCE TEST PASSED"
-   else
-      echo "&&&& PERFORMANCE TEST FAILED"
-   fi
+   PERF_THRESHOLD=$(awk 'BEGIN {print ('${REFERENCE_PERF}' * '${PERF_TOLERANCE}')}')
+   TARGET_PERF='--target-perf '${PERF_THRESHOLD}
 fi

-if (( ${ACCURACY_TEST_RESULT} )) && (( ${PERFORMANCE_TEST_RESULT} )); then
-   echo "&&&& PASSED"
-   exit 0
-else
-   echo "&&&& FAILED"
-   exit 1
-fi
+cd $REPO_DIR
+
+python3 -m launch train.py \
+   --dataset-dir $DATASET_DIR \
+   --seed 1 \
+   --epochs 1 \
+   --remain-steps 1.0 \
+   --target-bleu 17.20 \
+   --math ${MATH} \
+   ${TARGET_PERF}
--- a/scripts/tests/train_1epoch_fp32.sh
+++ b/scripts/tests/train_1epoch_fp32.sh
@ -3,81 +3,35 @@
 set -e

 DATASET_DIR='data/wmt16_de_en'
-RESULTS_DIR='gnmt_wmt16_test'
-REFERENCE_FILE=scripts/tests/reference_performance
-LOGFILE=results/${RESULTS_DIR}/log_gpu_0.log
+REPO_DIR='/workspace/gnmt'
+REFERENCE_FILE=$REPO_DIR/scripts/tests/reference_performance

-REFERENCE_ACCURACY=17.2
 MATH='fp32'
-PERFORMANCE_TOLERANCE=0.9
-
-python3 -m multiproc train.py \
-  --save ${RESULTS_DIR} \
-  --dataset-dir ${DATASET_DIR} \
-  --seed 1 \
-  --epochs 1 \
-  --math ${MATH} \
-  --print-freq 10 \
-  --batch-size 128 \
-  --model-config "{'num_layers': 4, 'hidden_size': 1024, 'dropout':0.2, 'share_embedding': True}" \
-  --optimization-config "{'optimizer': 'Adam', 'lr': 5e-4}"
+PERF_TOLERANCE=0.9

 GPU_NAME=`nvidia-smi --query-gpu=gpu_name --format=csv,noheader |uniq`
 echo 'GPU_NAME:' ${GPU_NAME}
 GPU_COUNT=`nvidia-smi --query-gpu=gpu_name --format=csv,noheader |wc -l`
 echo 'GPU_COUNT:' ${GPU_COUNT}

-# Accuracy test
-ACHIEVED_ACCURACY=`cat ${LOGFILE} \
-   |grep Summary \
-   |tail -n 1 \
-   |cut -f 4 \
-   |egrep -o [0-9.]+`
+REFERENCE_PERF=`grep "${MATH},${GPU_COUNT},${GPU_NAME}" \
+   ${REFERENCE_FILE} | \cut -f 4 -d ','`

-echo 'REFERENCE_ACCURACY:' ${REFERENCE_ACCURACY}
-echo 'ACHIEVED_ACCURACY:' ${ACHIEVED_ACCURACY}
-
-ACCURACY_TEST_RESULT=$(awk 'BEGIN {print ('${ACHIEVED_ACCURACY}' >= '${REFERENCE_ACCURACY}')}')
-
-if (( ${ACCURACY_TEST_RESULT} )); then
-    echo "&&&& ACCURACY TEST PASSED"
-else
-    echo "&&&& ACCURACY TEST FAILED"
-fi
-
-# Performance test
-ACHIEVED_PERFORMANCE=`cat ${LOGFILE} \
-   |grep Performance \
-   |tail -n 1 \
-   |cut -f 2 \
-   |egrep -o [0-9.]+`
-
-REFERENCE_PERFORMANCE=`grep "${MATH},${GPU_COUNT},${GPU_NAME}" ${REFERENCE_FILE} \
-   | \cut -f 4 -d ','`
-
-echo 'REFERENCE_PERFORMANCE:' ${REFERENCE_PERFORMANCE}
-echo 'ACHIEVED_PERFORMANCE:' ${ACHIEVED_PERFORMANCE}
-
-PERFORMANCE_TEST_RESULT=1
-
-if [ -z "${REFERENCE_PERFORMANCE}" ]; then
+if [ -z "${REFERENCE_PERF}" ]; then
   echo "WARNING: COULD NOT FIND REFERENCE PERFORMANCE FOR EXECUTED CONFIG"
-   echo "&&&& PERFORMANCE TEST WAIVED"
+   TARGET_PERF=''
 else
-   PERFORMANCE_TEST_RESULT=$(awk 'BEGIN {print ('${ACHIEVED_PERFORMANCE}' >= \
-      ('${REFERENCE_PERFORMANCE}' * '${PERFORMANCE_TOLERANCE}'))}')
-
-   if (( ${PERFORMANCE_TEST_RESULT} )); then
-      echo "&&&& PERFORMANCE TEST PASSED"
-   else
-      echo "&&&& PERFORMANCE TEST FAILED"
-   fi
+   PERF_THRESHOLD=$(awk 'BEGIN {print ('${REFERENCE_PERF}' * '${PERF_TOLERANCE}')}')
+   TARGET_PERF='--target-perf '${PERF_THRESHOLD}
 fi

-if (( ${ACCURACY_TEST_RESULT} )) && (( ${PERFORMANCE_TEST_RESULT} )); then
-   echo "&&&& PASSED"
-   exit 0
-else
-   echo "&&&& FAILED"
-   exit 1
-fi
+cd $REPO_DIR
+
+python3 -m launch train.py \
+   --dataset-dir $DATASET_DIR \
+   --seed 1 \
+   --epochs 1 \
+   --remain-steps 1.0 \
+   --target-bleu 17.20 \
+   --math ${MATH} \
+   ${TARGET_PERF}
--- a/scripts/tests/train_6epoch_fp16.sh
+++ b/scripts/tests/train_6epoch_fp16.sh
@ -3,81 +3,34 @@
 set -e

 DATASET_DIR='data/wmt16_de_en'
-RESULTS_DIR='gnmt_wmt16_test'
-REFERENCE_FILE=scripts/tests/reference_performance
-LOGFILE=results/${RESULTS_DIR}/log_gpu_0.log
+REPO_DIR='/workspace/gnmt'
+REFERENCE_FILE=$REPO_DIR/scripts/tests/reference_performance

-REFERENCE_ACCURACY=22.0
 MATH='fp16'
-PERFORMANCE_TOLERANCE=0.9
-
-python3 -m multiproc train.py \
-  --save ${RESULTS_DIR} \
-  --dataset-dir ${DATASET_DIR} \
-  --seed 1 \
-  --epochs 6 \
-  --math ${MATH} \
-  --print-freq 10 \
-  --batch-size 128 \
-  --model-config "{'num_layers': 4, 'hidden_size': 1024, 'dropout':0.2, 'share_embedding': True}" \
-  --optimization-config "{'optimizer': 'Adam', 'lr': 5e-4}"
+PERF_TOLERANCE=0.9

 GPU_NAME=`nvidia-smi --query-gpu=gpu_name --format=csv,noheader |uniq`
 echo 'GPU_NAME:' ${GPU_NAME}
 GPU_COUNT=`nvidia-smi --query-gpu=gpu_name --format=csv,noheader |wc -l`
 echo 'GPU_COUNT:' ${GPU_COUNT}

-# Accuracy test
-ACHIEVED_ACCURACY=`cat ${LOGFILE} \
-   |grep Summary \
-   |tail -n 1 \
-   |cut -f 4 \
-   |egrep -o [0-9.]+`
+REFERENCE_PERF=`grep "${MATH},${GPU_COUNT},${GPU_NAME}" \
+   ${REFERENCE_FILE} | \cut -f 4 -d ','`

-echo 'REFERENCE_ACCURACY:' ${REFERENCE_ACCURACY}
-echo 'ACHIEVED_ACCURACY:' ${ACHIEVED_ACCURACY}
-
-ACCURACY_TEST_RESULT=$(awk 'BEGIN {print ('${ACHIEVED_ACCURACY}' >= '${REFERENCE_ACCURACY}')}')
-
-if (( ${ACCURACY_TEST_RESULT} )); then
-    echo "&&&& ACCURACY TEST PASSED"
-else
-    echo "&&&& ACCURACY TEST FAILED"
-fi
-
-# Performance test
-ACHIEVED_PERFORMANCE=`cat ${LOGFILE} \
-   |grep Performance \
-   |tail -n 1 \
-   |cut -f 2 \
-   |egrep -o [0-9.]+`
-
-REFERENCE_PERFORMANCE=`grep "${MATH},${GPU_COUNT},${GPU_NAME}" ${REFERENCE_FILE} \
-   | \cut -f 4 -d ','`
-
-echo 'REFERENCE_PERFORMANCE:' ${REFERENCE_PERFORMANCE}
-echo 'ACHIEVED_PERFORMANCE:' ${ACHIEVED_PERFORMANCE}
-
-PERFORMANCE_TEST_RESULT=1
-
-if [ -z "${REFERENCE_PERFORMANCE}" ]; then
+if [ -z "${REFERENCE_PERF}" ]; then
   echo "WARNING: COULD NOT FIND REFERENCE PERFORMANCE FOR EXECUTED CONFIG"
-   echo "&&&& PERFORMANCE TEST WAIVED"
+   TARGET_PERF=''
 else
-   PERFORMANCE_TEST_RESULT=$(awk 'BEGIN {print ('${ACHIEVED_PERFORMANCE}' >= \
-      ('${REFERENCE_PERFORMANCE}' * '${PERFORMANCE_TOLERANCE}'))}')
-
-   if (( ${PERFORMANCE_TEST_RESULT} )); then
-      echo "&&&& PERFORMANCE TEST PASSED"
-   else
-      echo "&&&& PERFORMANCE TEST FAILED"
-   fi
+   PERF_THRESHOLD=$(awk 'BEGIN {print ('${REFERENCE_PERF}' * '${PERF_TOLERANCE}')}')
+   TARGET_PERF='--target-perf '${PERF_THRESHOLD}
 fi

-if (( ${ACCURACY_TEST_RESULT} )) && (( ${PERFORMANCE_TEST_RESULT} )); then
-   echo "&&&& PASSED"
-   exit 0
-else
-   echo "&&&& FAILED"
-   exit 1
-fi
+cd $REPO_DIR
+
+python3 -m launch train.py \
+   --dataset-dir $DATASET_DIR \
+   --seed 1 \
+   --epochs 6 \
+   --target-bleu 22.00 \
+   --math ${MATH} \
+   ${TARGET_PERF}
--- a/scripts/tests/train_6epoch_fp32.sh
+++ b/scripts/tests/train_6epoch_fp32.sh
@ -3,81 +3,34 @@
 set -e

 DATASET_DIR='data/wmt16_de_en'
-RESULTS_DIR='gnmt_wmt16_test'
-REFERENCE_FILE=scripts/tests/reference_performance
-LOGFILE=results/${RESULTS_DIR}/log_gpu_0.log
+REPO_DIR='/workspace/gnmt'
+REFERENCE_FILE=$REPO_DIR/scripts/tests/reference_performance

-REFERENCE_ACCURACY=22.0
 MATH='fp32'
-PERFORMANCE_TOLERANCE=0.9
-
-python3 -m multiproc train.py \
-  --save ${RESULTS_DIR} \
-  --dataset-dir ${DATASET_DIR} \
-  --seed 1 \
-  --epochs 6 \
-  --math ${MATH} \
-  --print-freq 10 \
-  --batch-size 128 \
-  --model-config "{'num_layers': 4, 'hidden_size': 1024, 'dropout':0.2, 'share_embedding': True}" \
-  --optimization-config "{'optimizer': 'Adam', 'lr': 5e-4}"
+PERF_TOLERANCE=0.9

 GPU_NAME=`nvidia-smi --query-gpu=gpu_name --format=csv,noheader |uniq`
 echo 'GPU_NAME:' ${GPU_NAME}
 GPU_COUNT=`nvidia-smi --query-gpu=gpu_name --format=csv,noheader |wc -l`
 echo 'GPU_COUNT:' ${GPU_COUNT}

-# Accuracy test
-ACHIEVED_ACCURACY=`cat ${LOGFILE} \
-   |grep Summary \
-   |tail -n 1 \
-   |cut -f 4 \
-   |egrep -o [0-9.]+`
+REFERENCE_PERF=`grep "${MATH},${GPU_COUNT},${GPU_NAME}" \
+   ${REFERENCE_FILE} | \cut -f 4 -d ','`

-echo 'REFERENCE_ACCURACY:' ${REFERENCE_ACCURACY}
-echo 'ACHIEVED_ACCURACY:' ${ACHIEVED_ACCURACY}
-
-ACCURACY_TEST_RESULT=$(awk 'BEGIN {print ('${ACHIEVED_ACCURACY}' >= '${REFERENCE_ACCURACY}')}')
-
-if (( ${ACCURACY_TEST_RESULT} )); then
-    echo "&&&& ACCURACY TEST PASSED"
-else
-    echo "&&&& ACCURACY TEST FAILED"
-fi
-
-# Performance test
-ACHIEVED_PERFORMANCE=`cat ${LOGFILE} \
-   |grep Performance \
-   |tail -n 1 \
-   |cut -f 2 \
-   |egrep -o [0-9.]+`
-
-REFERENCE_PERFORMANCE=`grep "${MATH},${GPU_COUNT},${GPU_NAME}" ${REFERENCE_FILE} \
-   | \cut -f 4 -d ','`
-
-echo 'REFERENCE_PERFORMANCE:' ${REFERENCE_PERFORMANCE}
-echo 'ACHIEVED_PERFORMANCE:' ${ACHIEVED_PERFORMANCE}
-
-PERFORMANCE_TEST_RESULT=1
-
-if [ -z "${REFERENCE_PERFORMANCE}" ]; then
+if [ -z "${REFERENCE_PERF}" ]; then
   echo "WARNING: COULD NOT FIND REFERENCE PERFORMANCE FOR EXECUTED CONFIG"
-   echo "&&&& PERFORMANCE TEST WAIVED"
+   TARGET_PERF=''
 else
-   PERFORMANCE_TEST_RESULT=$(awk 'BEGIN {print ('${ACHIEVED_PERFORMANCE}' >= \
-      ('${REFERENCE_PERFORMANCE}' * '${PERFORMANCE_TOLERANCE}'))}')
-
-   if (( ${PERFORMANCE_TEST_RESULT} )); then
-      echo "&&&& PERFORMANCE TEST PASSED"
-   else
-      echo "&&&& PERFORMANCE TEST FAILED"
-   fi
+   PERF_THRESHOLD=$(awk 'BEGIN {print ('${REFERENCE_PERF}' * '${PERF_TOLERANCE}')}')
+   TARGET_PERF='--target-perf '${PERF_THRESHOLD}
 fi

-if (( ${ACCURACY_TEST_RESULT} )) && (( ${PERFORMANCE_TEST_RESULT} )); then
-   echo "&&&& PASSED"
-   exit 0
-else
-   echo "&&&& FAILED"
-   exit 1
-fi
+cd $REPO_DIR
+
+python3 -m launch train.py \
+   --dataset-dir $DATASET_DIR \
+   --seed 1 \
+   --epochs 6 \
+   --target-bleu 22.00 \
+   --math ${MATH} \
+   ${TARGET_PERF}
--- a/scripts/train.sh
+++ b/scripts/train.sh
@ -2,17 +2,4 @@

 set -e

-DATASET_DIR='data/wmt16_de_en'
-RESULTS_DIR='gnmt_wmt16'
-
-# run training
-python3 -m multiproc train.py \
-  --save ${RESULTS_DIR} \
-  --dataset-dir ${DATASET_DIR} \
-  --seed 1 \
-  --epochs 6 \
-  --math fp16 \
-  --print-freq 10 \
-  --batch-size 128 \
-  --model-config "{'num_layers': 4, 'hidden_size': 1024, 'dropout':0.2, 'share_embedding': True}" \
-  --optimization-config "{'optimizer': 'Adam', 'lr': 5e-4}"
+python3 -m launch train.py
--- a/seq2seq/data/dataset.py
+++ b/seq2seq/data/dataset.py
@ -1,20 +1,23 @@
 import logging
+from operator import itemgetter

 import torch
-from torch.utils.data import Dataset
-from torch.utils.data.sampler import SequentialSampler
 from torch.utils.data import DataLoader
+from torch.utils.data import Dataset

 import seq2seq.data.config as config
 from seq2seq.data.sampler import BucketingSampler
+from seq2seq.data.sampler import DistributedSampler
+from seq2seq.data.sampler import ShardingSampler
+from seq2seq.data.sampler import StaticDistributedSampler


 def build_collate_fn(batch_first=False, parallel=True, sort=False):
    """
-    Factor for collate_fn functions.
+    Factory for collate_fn functions.

-    :param batch_first: if True returns batches in (batch, seq) format, if not
-        returns in (seq, batch) format
+    :param batch_first: if True returns batches in (batch, seq) format, if
+        False returns in (seq, batch) format
    :param parallel: if True builds batches from parallel corpus (src, tgt)
    :param sort: if True sorts by src sequence length within each batch
    """
@ -49,8 +52,8 @@ def build_collate_fn(batch_first=False, parallel=True, sort=False):
        """
        src_seqs, tgt_seqs = zip(*seqs)
        if sort:
-            key = lambda item: len(item[1])
-            indices, src_seqs = zip(*sorted(enumerate(src_seqs), key=key,
+            indices, src_seqs = zip(*sorted(enumerate(src_seqs),
+                                            key=lambda item: len(item[1]),
                                            reverse=True))
            tgt_seqs = [tgt_seqs[idx] for idx in indices]

@ -64,8 +67,8 @@ def build_collate_fn(batch_first=False, parallel=True, sort=False):
        :param src_seqs: source sequences
        """
        if sort:
-            key = lambda item: len(item[1])
-            indices, src_seqs = zip(*sorted(enumerate(src_seqs), key=key,
+            indices, src_seqs = zip(*sorted(enumerate(src_seqs),
+                                            key=lambda item: len(item[1]),
                                            reverse=True))
        else:
            indices = range(len(src_seqs))
@ -81,10 +84,22 @@ def build_collate_fn(batch_first=False, parallel=True, sort=False):
 class TextDataset(Dataset):
    def __init__(self, src_fname, tokenizer, min_len=None, max_len=None,
                 sort=False, max_size=None):
+        """
+        Constructor for the TextDataset. Builds monolingual dataset.
+
+        :param src_fname: path to the file with data
+        :param tokenizer: tokenizer
+        :param min_len: minimum sequence length
+        :param max_len: maximum sequence length
+        :param sort: sorts dataset by sequence length
+        :param max_size: loads at most 'max_size' samples from the input file,
+            if None loads the entire dataset
+        """

        self.min_len = min_len
        self.max_len = max_len
        self.parallel = False
+        self.sorted = False

        self.src = self.process_data(src_fname, tokenizer, max_size)

@ -98,11 +113,35 @@ class TextDataset(Dataset):
            self.sort_by_length()

    def sort_by_length(self):
+        """
+        Sorts dataset by the sequence length.
+        """
        self.lengths, indices = self.lengths.sort(descending=True)

        self.src = [self.src[idx] for idx in indices]
+        self.indices = indices.tolist()
+        self.sorted = True
+
+    def unsort(self, array):
+        """
+        "Unsorts" given array (restores original order of elements before
+        dataset was sorted by sequence length).
+
+        :param array: array to be "unsorted"
+        """
+        if self.sorted:
+            inverse = sorted(enumerate(self.indices), key=itemgetter(1))
+            array = [array[i[0]] for i in inverse]
+        return array

    def filter_data(self, min_len, max_len):
+        """
+        Preserves only samples which satisfy the following inequality:
+            min_len <= sample sequence length <= max_len
+
+        :param min_len: minimum sequence length
+        :param max_len: maximum sequence length
+        """
        logging.info(f'Filtering data, min len: {min_len}, max len: {max_len}')

        initial_len = len(self.src)
@ -116,6 +155,14 @@ class TextDataset(Dataset):
        logging.info(f'Pairs before: {initial_len}, after: {filtered_len}')

    def process_data(self, fname, tokenizer, max_size):
+        """
+        Loads data from the input file.
+
+        :param fname: input file name
+        :param tokenizer: tokenizer
+        :param max_size: loads at most 'max_size' samples from the input file,
+            if None loads the entire dataset
+        """
        logging.info(f'Processing data from {fname}')
        data = []
        with open(fname) as dfile:
@ -133,33 +180,57 @@ class TextDataset(Dataset):
    def __getitem__(self, idx):
        return self.src[idx]

-    def get_loader(self, batch_size=1, shuffle=False, num_workers=0,
-                   batch_first=False, drop_last=False, bucketing=True):
+    def get_loader(self, batch_size=1, seeds=None, shuffle=False,
+                   num_workers=0, batch_first=False, pad=False,
+                   batching=None, batching_opt={}):

        collate_fn = build_collate_fn(batch_first, parallel=self.parallel,
                                      sort=True)

        if shuffle:
-            sampler = BucketingSampler(self, batch_size, bucketing)
+            if batching == 'random':
+                sampler = DistributedSampler(self, batch_size, seeds)
+            elif batching == 'sharding':
+                sampler = ShardingSampler(self, batch_size, seeds,
+                                          batching_opt['shard_size'])
+            elif batching == 'bucketing':
+                sampler = BucketingSampler(self, batch_size, seeds,
+                                           batching_opt['num_buckets'])
+            else:
+                raise NotImplementedError
        else:
-            sampler = SequentialSampler(self)
+            sampler = StaticDistributedSampler(self, batch_size, pad)

        return DataLoader(self,
                          batch_size=batch_size,
                          collate_fn=collate_fn,
                          sampler=sampler,
                          num_workers=num_workers,
-                          pin_memory=False,
-                          drop_last=drop_last)
+                          pin_memory=True,
+                          drop_last=False)


 class ParallelDataset(TextDataset):
    def __init__(self, src_fname, tgt_fname, tokenizer,
                 min_len, max_len, sort=False, max_size=None):
+        """
+        Constructor for the ParallelDataset.
+        Tokenization is done when the data is loaded from the disk.
+
+        :param src_fname: path to the file with src language data
+        :param tgt_fname: path to the file with tgt language data
+        :param tokenizer: tokenizer
+        :param min_len: minimum sequence length
+        :param max_len: maximum sequence length
+        :param sort: sorts dataset by sequence length
+        :param max_size: loads at most 'max_size' samples from the input file,
+            if None loads the entire dataset
+        """

        self.min_len = min_len
        self.max_len = max_len
        self.parallel = True
+        self.sorted = False

        self.src = self.process_data(src_fname, tokenizer, max_size)
        self.tgt = self.process_data(tgt_fname, tokenizer, max_size)
@ -168,19 +239,37 @@ class ParallelDataset(TextDataset):
        self.filter_data(min_len, max_len)
        assert len(self.src) == len(self.tgt)

-        lengths = [len(s) + len(t) for (s, t) in zip(self.src, self.tgt)]
-        self.lengths = torch.tensor(lengths)
+        src_lengths = [len(s) for s in self.src]
+        tgt_lengths = [len(t) for t in self.tgt]
+        self.src_lengths = torch.tensor(src_lengths)
+        self.tgt_lengths = torch.tensor(tgt_lengths)
+        self.lengths = self.src_lengths + self.tgt_lengths

        if sort:
            self.sort_by_length()

    def sort_by_length(self):
+        """
+        Sorts dataset by the sequence length.
+        """
        self.lengths, indices = self.lengths.sort(descending=True)

        self.src = [self.src[idx] for idx in indices]
        self.tgt = [self.tgt[idx] for idx in indices]
+        self.src_lengths = [self.src_lengths[idx] for idx in indices]
+        self.tgt_lengths = [self.tgt_lengths[idx] for idx in indices]
+        self.indices = indices.tolist()
+        self.sorted = True

    def filter_data(self, min_len, max_len):
+        """
+        Preserves only samples which satisfy the following inequality:
+            min_len <= src sample sequence length <= max_len AND
+            min_len <= tgt sample sequence length <= max_len
+
+        :param min_len: minimum sequence length
+        :param max_len: maximum sequence length
+        """
        logging.info(f'Filtering data, min len: {min_len}, max len: {max_len}')

        initial_len = len(self.src)
@ -199,3 +288,98 @@ class ParallelDataset(TextDataset):

    def __getitem__(self, idx):
        return self.src[idx], self.tgt[idx]
+
+
+class LazyParallelDataset(TextDataset):
+    def __init__(self, src_fname, tgt_fname, tokenizer,
+                 min_len, max_len, sort=False, max_size=None):
+        """
+        Constructor for the LazyParallelDataset.
+        Tokenization is done on the fly.
+
+        :param src_fname: path to the file with src language data
+        :param tgt_fname: path to the file with tgt language data
+        :param tokenizer: tokenizer
+        :param min_len: minimum sequence length
+        :param max_len: maximum sequence length
+        :param sort: sorts dataset by sequence length
+        :param max_size: loads at most 'max_size' samples from the input file,
+            if None loads the entire dataset
+        """
+        self.min_len = min_len
+        self.max_len = max_len
+        self.parallel = True
+        self.sorted = False
+        self.tokenizer = tokenizer
+
+        self.raw_src = self.process_raw_data(src_fname, max_size)
+        self.raw_tgt = self.process_raw_data(tgt_fname, max_size)
+        assert len(self.raw_src) == len(self.raw_tgt)
+
+        logging.info(f'Filtering data, min len: {min_len}, max len: {max_len}')
+        # Subtracting 2 because EOS and BOS are added later during tokenization
+        self.filter_raw_data(min_len - 2, max_len - 2)
+        assert len(self.raw_src) == len(self.raw_tgt)
+
+        # Adding 2 because EOS and BOS are added later during tokenization
+        src_lengths = [i + 2 for i in self.src_len]
+        tgt_lengths = [i + 2 for i in self.tgt_len]
+        self.src_lengths = torch.tensor(src_lengths)
+        self.tgt_lengths = torch.tensor(tgt_lengths)
+        self.lengths = self.src_lengths + self.tgt_lengths
+
+    def process_raw_data(self, fname, max_size):
+        """
+        Loads data from the input file.
+
+        :param fname: input file name
+        :param max_size: loads at most 'max_size' samples from the input file,
+            if None loads the entire dataset
+        """
+        logging.info(f'Processing data from {fname}')
+        data = []
+        with open(fname) as dfile:
+            for idx, line in enumerate(dfile):
+                if max_size and idx == max_size:
+                    break
+                data.append(line)
+        return data
+
+    def filter_raw_data(self, min_len, max_len):
+        """
+        Preserves only samples which satisfy the following inequality:
+            min_len <= src sample sequence length <= max_len AND
+            min_len <= tgt sample sequence length <= max_len
+
+        :param min_len: minimum sequence length
+        :param max_len: maximum sequence length
+        """
+        initial_len = len(self.raw_src)
+        filtered_src = []
+        filtered_tgt = []
+        filtered_src_len = []
+        filtered_tgt_len = []
+        for src, tgt in zip(self.raw_src, self.raw_tgt):
+            src_len = src.count(' ') + 1
+            tgt_len = tgt.count(' ') + 1
+            if min_len <= src_len <= max_len and \
+                    min_len <= tgt_len <= max_len:
+                filtered_src.append(src)
+                filtered_tgt.append(tgt)
+                filtered_src_len.append(src_len)
+                filtered_tgt_len.append(tgt_len)
+
+        self.raw_src = filtered_src
+        self.raw_tgt = filtered_tgt
+        self.src_len = filtered_src_len
+        self.tgt_len = filtered_tgt_len
+        filtered_len = len(self.raw_src)
+        logging.info(f'Pairs before: {initial_len}, after: {filtered_len}')
+
+    def __getitem__(self, idx):
+        src = torch.tensor(self.tokenizer.segment(self.raw_src[idx]))
+        tgt = torch.tensor(self.tokenizer.segment(self.raw_tgt[idx]))
+        return src, tgt
+
+    def __len__(self):
+        return len(self.raw_src)
--- a/seq2seq/data/sampler.py
+++ b/seq2seq/data/sampler.py
@ -1,23 +1,22 @@
+import logging
+
 import torch
 from torch.utils.data.sampler import Sampler
-from seq2seq.utils import get_world_size, get_rank
+
+from seq2seq.utils import get_rank
+from seq2seq.utils import get_world_size


-class BucketingSampler(Sampler):
-    """
-    Distributed data sampler supporting bucketing by sequence length.
-    """
-    def __init__(self, dataset, batch_size, bucketing=True, world_size=None,
-                 rank=None):
+class DistributedSampler(Sampler):
+    def __init__(self, dataset, batch_size, seeds, world_size=None, rank=None):
        """
-        Constructor for the BucketingSampler.
+        Constructor for the DistributedSampler.

        :param dataset: dataset
-        :param batch_size: batch size
-        :param bucketing: if True enables bucketing by sequence length
-        :param world_size: number of processes participating in distributed
-            training
-        :param rank: rank of the current process within world_size
+        :param batch_size: local batch size
+        :param seeds: list of seeds, one seed for each training epoch
+        :param world_size: number of distributed workers
+        :param rank: rank of the current process
        """
        if world_size is None:
            world_size = get_world_size()
@ -28,75 +27,251 @@ class BucketingSampler(Sampler):
        self.world_size = world_size
        self.rank = rank
        self.epoch = 0
-        self.bucketing = bucketing
+        self.seeds = seeds

        self.batch_size = batch_size
        self.global_batch_size = batch_size * world_size

        self.data_len = len(self.dataset)
+
        self.num_samples = self.data_len // self.global_batch_size \
            * self.global_batch_size

-    def __iter__(self):
-        # deterministically shuffle based on epoch
-        g = torch.Generator()
-        g.manual_seed(self.epoch)
+    def init_rng(self):
+        """
+        Creates new RNG, seed depends on current epoch idx.
+        """
+        rng = torch.Generator()
+        seed = self.seeds[self.epoch]
+        logging.info(f'Sampler for epoch {self.epoch} uses seed {seed}')
+        rng.manual_seed(seed)
+        return rng

-        # generate permutation
-        indices = torch.randperm(self.data_len, generator=g)
-        # make indices evenly divisible by (batch_size * world_size)
-        indices = indices[:self.num_samples]
-
-        # splits the dataset into chunks of 'batches_in_shard' global batches
-        # each, sorts by (src + tgt) sequence length within each chunk,
-        # reshuffles all global batches
-        if self.bucketing:
-            batches_in_shard = 80
-            shard_size = self.global_batch_size * batches_in_shard
-            nshards = (self.num_samples + shard_size - 1) // shard_size
-
-            lengths = self.dataset.lengths[indices]
-
-            shards = [indices[i * shard_size:(i+1) * shard_size] for i in range(nshards)]
-            len_shards = [lengths[i * shard_size:(i+1) * shard_size] for i in range(nshards)]
-
-            indices = []
-            for len_shard in len_shards:
-                _, ind = len_shard.sort()
-                indices.append(ind)
-
-            output = tuple(shard[idx] for shard, idx in zip(shards, indices))
-            indices = torch.cat(output)
-
-            # global reshuffle
-            indices = indices.view(-1, self.global_batch_size)
-            order = torch.randperm(indices.shape[0], generator=g)
-            indices = indices[order, :]
-            indices = indices.view(-1)
+    def distribute_batches(self, indices):
+        """
+        Assigns batches to workers.
+        Consecutive ranks are getting consecutive batches.

+        :param indices: torch.tensor with batch indices
+        """
        assert len(indices) == self.num_samples

-        # build indices for each individual worker
-        # consecutive ranks are getting consecutive batches,
-        # default pytorch DistributedSampler assigns strided batches
-        # with offset = length / world_size
        indices = indices.view(-1, self.batch_size)
        indices = indices[self.rank::self.world_size].contiguous()
        indices = indices.view(-1)
        indices = indices.tolist()

        assert len(indices) == self.num_samples // self.world_size
+        return indices

+    def reshuffle_batches(self, indices, rng):
+        """
+        Permutes global batches
+
+        :param indices: torch.tensor with batch indices
+        :param rng: instance of torch.Generator
+        """
+        indices = indices.view(-1, self.global_batch_size)
+        num_batches = indices.shape[0]
+        order = torch.randperm(num_batches, generator=rng)
+        indices = indices[order, :]
+        indices = indices.view(-1)
+        return indices
+
+    def __iter__(self):
+        rng = self.init_rng()
+        # generate permutation
+        indices = torch.randperm(self.data_len, generator=rng)
+
+        # make indices evenly divisible by (batch_size * world_size)
+        indices = indices[:self.num_samples]
+
+        # assign batches to workers
+        indices = self.distribute_batches(indices)
        return iter(indices)

-    def __len__(self):
-        return self.num_samples // self.world_size
-
    def set_epoch(self, epoch):
        """
-        Sets current epoch index. This value is used to seed RNGs in __iter__()
-        function.
+        Sets current epoch index.
+        Epoch index is used to seed RNG in __iter__() function.

        :param epoch: index of current epoch
        """
        self.epoch = epoch
+
+    def __len__(self):
+        return self.num_samples // self.world_size
+
+
+class ShardingSampler(DistributedSampler):
+    def __init__(self, dataset, batch_size, seeds, shard_size,
+                 world_size=None, rank=None):
+        """
+        Constructor for the ShardingSampler.
+
+        :param dataset: dataset
+        :param batch_size: local batch size
+        :param seeds: list of seeds, one seed for each training epoch
+        :param shard_size: number of global batches within one shard
+        :param world_size: number of distributed workers
+        :param rank: rank of the current process
+        """
+
+        super().__init__(dataset, batch_size, seeds, world_size, rank)
+
+        self.shard_size = shard_size
+        self.num_samples = self.data_len // self.global_batch_size \
+            * self.global_batch_size
+
+    def __iter__(self):
+        rng = self.init_rng()
+        # generate permutation
+        indices = torch.randperm(self.data_len, generator=rng)
+        # make indices evenly divisible by (batch_size * world_size)
+        indices = indices[:self.num_samples]
+
+        # splits the dataset into chunks of 'self.shard_size' global batches
+        # each, sorts by (src + tgt) sequence length within each chunk,
+        # reshuffles all global batches
+        shard_size = self.global_batch_size * self.shard_size
+        nshards = (self.num_samples + shard_size - 1) // shard_size
+
+        lengths = self.dataset.lengths[indices]
+
+        shards = [indices[i * shard_size:(i+1) * shard_size] for i in range(nshards)]
+        len_shards = [lengths[i * shard_size:(i+1) * shard_size] for i in range(nshards)]
+
+        # sort by (src + tgt) sequence length within each shard
+        indices = []
+        for len_shard in len_shards:
+            _, ind = len_shard.sort()
+            indices.append(ind)
+
+        output = tuple(shard[idx] for shard, idx in zip(shards, indices))
+
+        # build batches
+        indices = torch.cat(output)
+        # perform global reshuffle of all global batches
+        indices = self.reshuffle_batches(indices, rng)
+        # distribute batches to individual workers
+        indices = self.distribute_batches(indices)
+        return iter(indices)
+
+
+class BucketingSampler(DistributedSampler):
+    def __init__(self, dataset, batch_size, seeds, num_buckets,
+                 world_size=None, rank=None):
+        """
+        Constructor for the BucketingSampler.
+
+        :param dataset: dataset
+        :param batch_size: local batch size
+        :param seeds: list of seeds, one seed for each training epoch
+        :param num_buckets: number of buckets
+        :param world_size: number of distributed workers
+        :param rank: rank of the current process
+        """
+
+        super().__init__(dataset, batch_size, seeds, world_size, rank)
+
+        self.num_buckets = num_buckets
+        bucket_width = (dataset.max_len + num_buckets - 1) // num_buckets
+
+        # assign sentences to buckets based on src and tgt sequence lengths
+        bucket_ids = torch.max(dataset.src_lengths // bucket_width,
+                               dataset.tgt_lengths // bucket_width)
+        bucket_ids.clamp_(0, num_buckets - 1)
+
+        # build buckets
+        all_indices = torch.tensor(range(self.data_len))
+        self.buckets = []
+        self.num_samples = 0
+        global_bs = self.global_batch_size
+
+        for bid in range(num_buckets):
+            # gather indices for current bucket
+            indices = all_indices[bucket_ids == bid]
+            self.buckets.append(indices)
+
+            # count number of samples in current bucket
+            samples = len(indices) // global_bs * global_bs
+            self.num_samples += samples
+
+    def __iter__(self):
+        rng = self.init_rng()
+        global_bs = self.global_batch_size
+
+        indices = []
+        for bid in range(self.num_buckets):
+            # random shuffle within current bucket
+            perm = torch.randperm(len(self.buckets[bid]), generator=rng)
+            bucket_indices = self.buckets[bid][perm]
+
+            # make bucket_indices evenly divisible by global batch size
+            length = len(bucket_indices) // global_bs * global_bs
+            bucket_indices = bucket_indices[:length]
+            assert len(bucket_indices) % self.global_batch_size == 0
+
+            # add samples from current bucket to indices for current epoch
+            indices.append(bucket_indices)
+
+        indices = torch.cat(indices)
+        assert len(indices) % self.global_batch_size == 0
+
+        # perform global reshuffle of all global batches
+        indices = self.reshuffle_batches(indices, rng)
+        # distribute batches to individual workers
+        indices = self.distribute_batches(indices)
+        return iter(indices)
+
+
+class StaticDistributedSampler(Sampler):
+    def __init__(self, dataset, batch_size, pad, world_size=None, rank=None):
+        """
+        Constructor for the StaticDistributedSampler.
+
+        :param dataset: dataset
+        :param batch_size: local batch size
+        :param pad: if True: pads dataset to a multiple of global_batch_size
+            samples
+        :param world_size: number of distributed workers
+        :param rank: rank of the current process
+        """
+        if world_size is None:
+            world_size = get_world_size()
+        if rank is None:
+            rank = get_rank()
+
+        self.world_size = world_size
+
+        global_batch_size = batch_size * world_size
+
+        data_len = len(dataset)
+        num_samples = (data_len + global_batch_size - 1) \
+            // global_batch_size * global_batch_size
+        self.num_samples = num_samples
+
+        indices = list(range(data_len))
+        if pad:
+            # pad dataset to a multiple of global_batch_size samples, uses
+            # sample with idx 0 as pad
+            indices += [0] * (num_samples - len(indices))
+        else:
+            # temporary pad to a multiple of global batch size, pads with "-1"
+            # which is later removed from the list of indices
+            indices += [-1] * (num_samples - len(indices))
+        indices = torch.tensor(indices)
+
+        indices = indices.view(-1, batch_size)
+        indices = indices[rank::world_size].contiguous()
+        indices = indices.view(-1)
+        # remove temporary pad
+        indices = indices[indices != -1]
+        indices = indices.tolist()
+        self.indices = indices
+
+    def __iter__(self):
+        return iter(self.indices)
+
+    def __len__(self):
+        return len(self.indices)
--- a/seq2seq/data/tokenizer.py
+++ b/seq2seq/data/tokenizer.py
@ -1,43 +1,76 @@
 import logging
 from collections import defaultdict
+from functools import partial

 import seq2seq.data.config as config

-def default():
-    return config.UNK

 class Tokenizer:
    """
    Tokenizer class.
    """
-    def __init__(self, vocab_fname, separator='@@'):
+    def __init__(self, vocab_fname=None, pad=1, separator='@@'):
        """
        Constructor for the Tokenizer class.

        :param vocab_fname: path to the file with vocabulary
+        :param pad: pads vocabulary to a multiple of 'pad' tokens
        :param separator: tokenization separator
        """
-        self.separator = separator
+        if vocab_fname:
+            self.separator = separator

-        logging.info(f'Building vocabulary from {vocab_fname}')
-        vocab = [config.PAD_TOKEN, config.UNK_TOKEN,
-                 config.BOS_TOKEN, config.EOS_TOKEN]
+            logging.info(f'Building vocabulary from {vocab_fname}')
+            vocab = [config.PAD_TOKEN, config.UNK_TOKEN,
+                     config.BOS_TOKEN, config.EOS_TOKEN]

-        with open(vocab_fname) as vfile:
-            for line in vfile:
-                vocab.append(line.strip())
+            with open(vocab_fname) as vfile:
+                for line in vfile:
+                    vocab.append(line.strip())

-        logging.info(f'Size of vocabulary: {len(vocab)}')
-        self.vocab_size = len(vocab)
+            self.pad_vocabulary(vocab, pad)

+            self.vocab_size = len(vocab)
+            logging.info(f'Size of vocabulary: {self.vocab_size}')

-        self.tok2idx = defaultdict(default)
-        for idx, token in enumerate(vocab):
-            self.tok2idx[token] = idx
+            self.tok2idx = defaultdict(partial(int, config.UNK))
+            for idx, token in enumerate(vocab):
+                self.tok2idx[token] = idx

-        self.idx2tok = {}
-        for key, value in self.tok2idx.items():
-            self.idx2tok[value] = key
+            self.idx2tok = {}
+            for key, value in self.tok2idx.items():
+                self.idx2tok[value] = key
+
+    def pad_vocabulary(self, vocab, pad):
+        """
+        Pads vocabulary to a multiple of 'pad' tokens.
+
+        :param vocab: list with vocabulary
+        :param pad: integer
+        """
+        vocab_size = len(vocab)
+        padded_vocab_size = (vocab_size + pad - 1) // pad * pad
+        for i in range(0, padded_vocab_size - vocab_size):
+            token = f'madeupword{i:04d}'
+            vocab.append(token)
+        assert len(vocab) % pad == 0
+
+    def get_state(self):
+        logging.info(f'Saving state of the tokenizer')
+        state = {
+            'separator': self.separator,
+            'vocab_size': self.vocab_size,
+            'tok2idx': self.tok2idx,
+            'idx2tok': self.idx2tok,
+        }
+        return state
+
+    def set_state(self, state):
+        logging.info(f'Restoring state of the tokenizer')
+        self.separator = state['separator']
+        self.vocab_size = state['vocab_size']
+        self.tok2idx = state['tok2idx']
+        self.idx2tok = state['idx2tok']

    def segment(self, line):
        """
@ -62,6 +95,11 @@ class Tokenizer:
        returns: string representing detokenized sentence
        """
        detok = delim.join([self.idx2tok[idx] for idx in inputs])
-        detok = detok.replace(
-            self.separator+ ' ', '').replace(self.separator, '')
+        detok = detok.replace(self.separator + ' ', '')
+        detok = detok.replace(self.separator, '')
+
+        detok = detok.replace(config.BOS_TOKEN, '')
+        detok = detok.replace(config.EOS_TOKEN, '')
+        detok = detok.replace(config.PAD_TOKEN, '')
+        detok = detok.strip()
        return detok
--- a/seq2seq/inference/beam_search.py
+++ b/seq2seq/inference/beam_search.py
@ -81,7 +81,8 @@ class SequenceGenerator:
            counter += 1

            words = words.view(word_view)
-            words, logprobs, attn, context = self.model.generate(words, context, 1)
+            output = self.model.generate(words, context, 1)
+            words, logprobs, attn, context = output
            words = words.view(-1)

            translation[active, idx] = words
@ -123,13 +124,15 @@ class SequenceGenerator:
        max_seq_len = self.max_seq_len
        cov_penalty_factor = self.cov_penalty_factor

-        translation = torch.zeros(batch_size * beam_size, max_seq_len, dtype=torch.int64)
+        translation = torch.zeros(batch_size * beam_size, max_seq_len,
+                                  dtype=torch.int64)
        lengths = torch.ones(batch_size * beam_size, dtype=torch.int64)
        scores = torch.zeros(batch_size * beam_size, dtype=torch.float32)

        active = torch.arange(0, batch_size * beam_size, dtype=torch.int64)
        base_mask = torch.arange(0, batch_size * beam_size, dtype=torch.int64)
-        global_offset = torch.arange(0, batch_size * beam_size, beam_size, dtype=torch.int64)
+        global_offset = torch.arange(0, batch_size * beam_size, beam_size,
+                                     dtype=torch.int64)

        eos_beam_fill = torch.tensor([0] + (beam_size - 1) * [float('-inf')])

@ -161,21 +164,23 @@ class SequenceGenerator:
            _, seq, feature = context[0].shape
            context[0].unsqueeze_(1)
            context[0] = context[0].expand(-1, beam_size, -1, -1)
-            context[0] = context[0].contiguous().view(batch_size * beam_size, seq, feature)
+            context[0] = context[0].contiguous().view(batch_size * beam_size,
+                                                      seq, feature)
            # context[0]: (batch * beam, seq, feature)
        else:
            # context[0] (encoder state): (seq, batch, feature)
            seq, _, feature = context[0].shape
            context[0].unsqueeze_(2)
            context[0] = context[0].expand(-1, -1, beam_size, -1)
-            context[0] = context[0].contiguous().view(seq, batch_size * beam_size, feature)
+            context[0] = context[0].contiguous().view(seq, batch_size *
+                                                      beam_size, feature)
            # context[0]: (seq, batch * beam,  feature)

-        #context[1] (encoder seq length): (batch)
+        # context[1] (encoder seq length): (batch)
        context[1].unsqueeze_(1)
        context[1] = context[1].expand(-1, beam_size)
        context[1] = context[1].contiguous().view(batch_size * beam_size)
-        #context[1]: (batch * beam)
+        # context[1]: (batch * beam)

        accu_attn_scores = torch.zeros(batch_size * beam_size, seq)
        if self.cuda:
@ -194,7 +199,8 @@ class SequenceGenerator:

            lengths[active[~eos_mask.view(-1)]] += 1

-            words, logprobs, attn, context = self.model.generate(words, context, beam_size)
+            output = self.model.generate(words, context, beam_size)
+            words, logprobs, attn, context = output

            attn = attn.float().squeeze(attn_query_dim)
            attn = attn.masked_fill(eos_mask.view(-1).unsqueeze(1), 0)
--- a/seq2seq/inference/inference.py
+++ b/seq2seq/inference/inference.py
@ -1,7 +1,8 @@
+import contextlib
 import logging
+import os
 import subprocess
 import time
-import os

 import torch
 import torch.distributed as dist
@ -9,7 +10,18 @@ import torch.distributed as dist
 import seq2seq.data.config as config
 from seq2seq.inference.beam_search import SequenceGenerator
 from seq2seq.utils import AverageMeter
-from seq2seq.utils import get_rank, get_world_size
+from seq2seq.utils import barrier
+from seq2seq.utils import get_rank
+from seq2seq.utils import get_world_size
+
+
+def gather_predictions(preds):
+    world_size = get_world_size()
+    if world_size > 1:
+        all_preds = [preds.new(preds.size(0), preds.size(1)) for i in range(world_size)]
+        dist.all_gather(all_preds, preds)
+        preds = torch.cat(all_preds)
+    return preds


 class Translator:
@ -96,17 +108,25 @@ class Translator:
            eval_path = self.build_eval_path(epoch, iteration)
        detok_eval_path = eval_path + '.detok'

-        rank = get_rank()
-        if rank == 0:
-            logging.info(f'Running evaluation on test set')
-            self.model.eval()
-            torch.cuda.empty_cache()
+        with contextlib.suppress(FileNotFoundError):
+            os.remove(eval_path)
+            os.remove(detok_eval_path)

-            self.evaluate(epoch, iteration, eval_path, summary)
+        rank = get_rank()
+        logging.info(f'Running evaluation on test set')
+        self.model.eval()
+        torch.cuda.empty_cache()
+
+        output = self.evaluate(epoch, iteration, summary)
+        output = output[:len(self.loader.dataset)]
+        output = self.loader.dataset.unsort(output)
+
+        if rank == 0:
+            with open(eval_path, 'a') as eval_file:
+                eval_file.writelines(output)
            if calc_bleu:
                self.run_detokenizer(eval_path)
-                test_bleu[0] = self.run_sacrebleu(detok_eval_path,
-                                                  reference_path)
+                test_bleu[0] = self.run_sacrebleu(detok_eval_path, reference_path)
                if summary:
                    logging.info(f'BLEU on test dataset: {test_bleu[0]:.2f}')

@ -114,8 +134,9 @@ class Translator:
                    logging.info(f'Target accuracy reached')
                    break_training[0] = 1

-            torch.cuda.empty_cache()
-            logging.info(f'Finished evaluation on test set')
+        barrier()
+        torch.cuda.empty_cache()
+        logging.info(f'Finished evaluation on test set')

        if self.distributed:
            dist.broadcast(break_training, 0)
@ -123,35 +144,29 @@ class Translator:

        return test_bleu[0].item(), break_training[0].item()

-    def evaluate(self, epoch, iteration, eval_path, summary):
+    def evaluate(self, epoch, iteration, summary):
        """
        Runs evaluation on test dataset.

        :param epoch: index of the current epoch
        :param iteration: index of the current iteration
-        :param eval_path: path to the file for saving results
        :param summary: if True prints summary
        """
-        eval_file = open(eval_path, 'w')
-
        batch_time = AverageMeter(False)
        tot_tok_per_sec = AverageMeter(False)
        iterations = AverageMeter(False)
        enc_seq_len = AverageMeter(False)
        dec_seq_len = AverageMeter(False)
-        total_iters = 0
-        total_lines = 0
        stats = {}

+        output = []
+
        for i, (src, indices) in enumerate(self.loader):
            translate_timer = time.time()
            src, src_length = src

-            if self.batch_first:
-                batch_size = src.size(0)
-            else:
-                batch_size = src.size(1)
-            total_lines += batch_size
+            batch_size = self.loader.batch_size
+            global_batch_size = batch_size * get_world_size()
            beam_size = self.beam_size

            bos = [self.insert_target_start] * (batch_size * beam_size)
@ -179,20 +194,18 @@ class Translator:
                    generator = self.generator.beam_search
                preds, lengths, counter = generator(batch_size, bos, context)

-            preds = preds.cpu()
-            lengths = lengths.cpu()
-            stats['total_dec_len'] = int(lengths.sum())
+            stats['total_dec_len'] = lengths.sum().item()
            stats['iters'] = counter
-            total_iters += stats['iters']

-            output = []
-            for idx, pred in enumerate(preds):
-                end = lengths[idx] - 1
-                pred = pred[1:end].tolist()
-                out = self.tokenizer.detokenize(pred)
-                output.append(out)
+            indices = torch.tensor(indices).to(preds)
+            preds = preds.scatter(0, indices.unsqueeze(1).expand_as(preds), preds)

-            output = [output[indices.index(i)] for i in range(len(output))]
+            preds = gather_predictions(preds).cpu()
+
+            for pred in preds:
+                pred = pred.tolist()
+                detok = self.tokenizer.detokenize(pred)
+                output.append(detok + '\n')

            elapsed = time.time() - translate_timer
            batch_time.update(elapsed, batch_size)
@ -219,25 +232,28 @@ class Translator:
                log = ''.join(log)
                logging.info(log)

-            for line in output:
-                eval_file.write(line)
-                eval_file.write('\n')
+        tot_tok_per_sec.reduce('sum')
+        enc_seq_len.reduce('mean')
+        dec_seq_len.reduce('mean')
+        batch_time.reduce('mean')
+        iterations.reduce('sum')

-        eval_file.close()
-        if summary:
-            time_per_sentence = (batch_time.avg / self.loader.batch_size)
+        if summary and get_rank() == 0:
+            time_per_sentence = (batch_time.avg / global_batch_size)
            log = []
            log += f'TEST SUMMARY:\n'
-            log += f'Lines translated: {total_lines}\t'
+            log += f'Lines translated: {len(self.loader.dataset)}\t'
            log += f'Avg total tokens/s: {tot_tok_per_sec.avg:.0f}\n'
            log += f'Avg time per batch: {batch_time.avg:.3f} s\t'
            log += f'Avg time per sentence: {1000*time_per_sentence:.3f} ms\n'
            log += f'Avg encoder seq len: {enc_seq_len.avg:.2f}\t'
            log += f'Avg decoder seq len: {dec_seq_len.avg:.2f}\t'
-            log += f'Total decoder iterations: {total_iters}'
+            log += f'Total decoder iterations: {int(iterations.sum)}'
            log = ''.join(log)
            logging.info(log)

+        return output
+
    def run_detokenizer(self, eval_path):
        """
        Executes moses detokenizer on eval_path file and saves result to
--- a/seq2seq/models/attention.py
+++ b/seq2seq/models/attention.py
@ -12,7 +12,7 @@ class BahdanauAttention(nn.Module):
    Implementation is very similar to tf.contrib.seq2seq.BahdanauAttention
    """
    def __init__(self, query_size, key_size, num_units, normalize=False,
-                 dropout=0, batch_first=False):
+                 batch_first=False, init_weight=0.1):
        """
        Constructor for the BahdanauAttention.

@ -20,9 +20,10 @@ class BahdanauAttention(nn.Module):
        :param key_size: feature dimension for keys
        :param num_units: internal feature dimension
        :param normalize: whether to normalize energy term
-        :param dropout: probability of the dropout (between softmax and bmm)
        :param batch_first: if True batch size is the 1st dimension, if False
            the sequence is first and batch size is second
+        :param init_weight: range for uniform initializer used to initialize
+            Linear key and query transform layers and linear_att vector
        """
        super(BahdanauAttention, self).__init__()

@ -32,10 +33,11 @@ class BahdanauAttention(nn.Module):

        self.linear_q = nn.Linear(query_size, num_units, bias=False)
        self.linear_k = nn.Linear(key_size, num_units, bias=False)
+        nn.init.uniform_(self.linear_q.weight.data, -init_weight, init_weight)
+        nn.init.uniform_(self.linear_k.weight.data, -init_weight, init_weight)

        self.linear_att = Parameter(torch.Tensor(num_units))

-        self.dropout = nn.Dropout(dropout)
        self.mask = None

        if self.normalize:
@ -45,14 +47,14 @@ class BahdanauAttention(nn.Module):
            self.register_parameter('normalize_scalar', None)
            self.register_parameter('normalize_bias', None)

-        self.reset_parameters()
+        self.reset_parameters(init_weight)

-    def reset_parameters(self):
+    def reset_parameters(self, init_weight):
        """
        Sets initial random values for trainable parameters.
        """
        stdv = 1. / math.sqrt(self.num_units)
-        self.linear_att.data.uniform_(-stdv, stdv)
+        self.linear_att.data.uniform_(-init_weight, init_weight)

        if self.normalize:
            self.normalize_scalar.data.fill_(stdv)
@ -74,7 +76,8 @@ class BahdanauAttention(nn.Module):
        else:
            max_len = context.size(0)

-        indices = torch.arange(0, max_len, dtype=torch.int64, device=context.device)
+        indices = torch.arange(0, max_len, dtype=torch.int64,
+                               device=context.device)
        self.mask = indices >= (context_len.unsqueeze(1))

    def calc_score(self, att_query, att_keys):
@ -96,16 +99,12 @@ class BahdanauAttention(nn.Module):

        if self.normalize:
            sum_qk = sum_qk + self.normalize_bias
-
-            tmp = self.linear_att.to(torch.float32)
-            linear_att = tmp / tmp.norm()
-            linear_att = linear_att.to(self.normalize_scalar)
-
+            linear_att = self.linear_att / self.linear_att.norm()
            linear_att = linear_att * self.normalize_scalar
        else:
            linear_att = self.linear_att

-        out = F.tanh(sum_qk).matmul(linear_att)
+        out = torch.tanh(sum_qk).matmul(linear_att)
        return out

    def forward(self, query, keys):
@ -152,7 +151,6 @@ class BahdanauAttention(nn.Module):

        # Calculate the weighted average of the attention inputs according to
        # the scores
-        scores_normalized = self.dropout(scores_normalized)
        # context: (b x t_q x n)
        context = torch.bmm(scores_normalized, keys)

--- a/seq2seq/models/decoder.py
+++ b/seq2seq/models/decoder.py
@ -3,16 +3,18 @@ import itertools
 import torch
 import torch.nn as nn

-from seq2seq.models.attention import BahdanauAttention
 import seq2seq.data.config as config
+from seq2seq.models.attention import BahdanauAttention
+from seq2seq.utils import init_lstm_


 class RecurrentAttention(nn.Module):
    """
-    LSTM with an attention module.
+    LSTM wrapped with an attention module.
    """
-    def __init__(self, input_size, context_size, hidden_size, num_layers=1,
-                 bias=True, batch_first=False, dropout=0):
+    def __init__(self, input_size=1024, context_size=1024, hidden_size=1024,
+                 num_layers=1, batch_first=False, dropout=0.2,
+                 init_weight=0.1):
        """
        Constructor for the RecurrentAttention.

@ -20,16 +22,17 @@ class RecurrentAttention(nn.Module):
        :param context_size: number of features in output from encoder
        :param hidden_size: internal hidden size
        :param num_layers: number of layers in LSTM
-        :param bias: enables bias in LSTM layers
        :param batch_first: if True the model uses (batch,seq,feature) tensors,
            if false the model uses (seq, batch, feature)
-        :param dropout: probability of dropout
+        :param dropout: probability of dropout (on input to LSTM layer)
+        :param init_weight: range for the uniform initializer
        """

        super(RecurrentAttention, self).__init__()

-        self.rnn = nn.LSTM(input_size, hidden_size, num_layers, bias,
-                           batch_first)
+        self.rnn = nn.LSTM(input_size, hidden_size, num_layers, bias=True,
+                           batch_first=batch_first)
+        init_lstm_(self.rnn, init_weight)

        self.attn = BahdanauAttention(hidden_size, context_size, context_size,
                                      normalize=True, batch_first=batch_first)
@ -52,9 +55,9 @@ class RecurrentAttention(nn.Module):
        # softmax
        self.attn.set_mask(context_len, context)

+        inputs = self.dropout(inputs)
        rnn_outputs, hidden = self.rnn(inputs, hidden)
        attn_outputs, scores = self.attn(rnn_outputs, context)
-        rnn_outputs = self.dropout(rnn_outputs)

        return rnn_outputs, hidden, attn_outputs, scores

@ -63,23 +66,18 @@ class Classifier(nn.Module):
    """
    Fully-connected classifier
    """
-    def __init__(self, in_features, out_features, math='fp32'):
+    def __init__(self, in_features, out_features, init_weight=0.1):
        """
        Constructor for the Classifier.

        :param in_features: number of input features
        :param out_features: number of output features (size of vocabulary)
-        :param math: arithmetic type, 'fp32' or 'fp16'
+        :param init_weight: range for the uniform initializer
        """
        super(Classifier, self).__init__()
-
-        self.out_features = out_features
-
-        # padding required to trigger HMMA kernels
-        if math == 'fp16':
-            out_features = (out_features + 7) // 8 * 8
-
        self.classifier = nn.Linear(in_features, out_features)
+        nn.init.uniform_(self.classifier.weight.data, -init_weight, init_weight)
+        nn.init.uniform_(self.classifier.bias.data, -init_weight, init_weight)

    def forward(self, x):
        """
@ -88,7 +86,6 @@ class Classifier(nn.Module):
        :param x: output from decoder
        """
        out = self.classifier(x)
-        out = out[..., :self.out_features]
        return out


@ -102,22 +99,24 @@ class ResidualRecurrentDecoder(nn.Module):
    LSTM layer of the decoder goes into the attention module, then the
    re-weighted context is concatenated with inputs to all subsequent LSTM
    layers in the decoder at the current timestep.
+
+    Residual connections are enabled after 3rd LSTM layer, dropout is applied
+    on inputs to LSTM layers.
    """
-    def __init__(self, vocab_size, hidden_size=128, num_layers=8, bias=True,
-                 dropout=0, batch_first=False, math='fp32', embedder=None):
+    def __init__(self, vocab_size, hidden_size=1024, num_layers=4, dropout=0.2,
+                 batch_first=False, embedder=None, init_weight=0.1):
        """
        Constructor of the ResidualRecurrentDecoder.

        :param vocab_size: size of vocabulary
        :param hidden_size: hidden size for LSMT layers
        :param num_layers: number of LSTM layers
-        :param bias: enables bias in LSTM layers
-        :param dropout: probability of dropout (between LSTM layers)
+        :param dropout: probability of dropout (on input to LSTM layers)
        :param batch_first: if True the model uses (batch,seq,feature) tensors,
            if false the model uses (seq, batch, feature)
-        :param math: arithmetic type, 'fp32' or 'fp16'
-        :param embedder: embedding module, if None constructor will create new
-            embedding layer
+        :param embedder: instance of nn.Embedding, if None constructor will
+            create new embedding layer
+        :param init_weight: range for the uniform initializer
        """
        super(ResidualRecurrentDecoder, self).__init__()

@ -125,21 +124,26 @@ class ResidualRecurrentDecoder(nn.Module):

        self.att_rnn = RecurrentAttention(hidden_size, hidden_size,
                                          hidden_size, num_layers=1,
-                                          batch_first=batch_first)
+                                          batch_first=batch_first,
+                                          dropout=dropout)

        self.rnn_layers = nn.ModuleList()
        for _ in range(num_layers - 1):
            self.rnn_layers.append(
-                nn.LSTM(2 * hidden_size, hidden_size, num_layers=1, bias=bias,
+                nn.LSTM(2 * hidden_size, hidden_size, num_layers=1, bias=True,
                        batch_first=batch_first))

+        for lstm in self.rnn_layers:
+            init_lstm_(lstm, init_weight)
+
        if embedder is not None:
            self.embedder = embedder
        else:
            self.embedder = nn.Embedding(vocab_size, hidden_size,
                                         padding_idx=config.PAD)
+            nn.init.uniform_(embedder.weight.data, -init_weight, init_weight)

-        self.classifier = Classifier(hidden_size, vocab_size, math)
+        self.classifier = Classifier(hidden_size, vocab_size)
        self.dropout = nn.Dropout(p=dropout)

    def init_hidden(self, hidden):
@ -199,15 +203,15 @@ class ResidualRecurrentDecoder(nn.Module):
        x, h, attn, scores = self.att_rnn(x, hidden[0], enc_context, enc_len)
        self.append_hidden(h)

-        x = self.dropout(x)
        x = torch.cat((x, attn), dim=2)
+        x = self.dropout(x)
        x, h = self.rnn_layers[0](x, hidden[1])
        self.append_hidden(h)

        for i in range(1, len(self.rnn_layers)):
            residual = x
-            x = self.dropout(x)
            x = torch.cat((x, attn), dim=2)
+            x = self.dropout(x)
            x, h = self.rnn_layers[i](x, hidden[i + 1])
            self.append_hidden(h)
            x = x + residual
--- a/seq2seq/models/encoder.py
+++ b/seq2seq/models/encoder.py
@ -3,6 +3,7 @@ from torch.nn.utils.rnn import pack_padded_sequence
 from torch.nn.utils.rnn import pad_packed_sequence

 import seq2seq.data.config as config
+from seq2seq.utils import init_lstm_


 class ResidualRecurrentEncoder(nn.Module):
@ -10,42 +11,48 @@ class ResidualRecurrentEncoder(nn.Module):
    Encoder with Embedding, LSTM layers, residual connections and optional
    dropout.

-    The first LSTM layer is bidirectional and uses variable sequence length API,
-    the remaining (num_layers-1) layers are unidirectional. Residual
-    connections are enabled after third LSTM layer, dropout is applied between
-    LSTM layers.
+    The first LSTM layer is bidirectional and uses variable sequence length
+    API, the remaining (num_layers-1) layers are unidirectional. Residual
+    connections are enabled after third LSTM layer, dropout is applied on
+    inputs to LSTM layers.
    """
-    def __init__(self, vocab_size, hidden_size=128, num_layers=8, bias=True,
-                 dropout=0, batch_first=False, embedder=None):
+    def __init__(self, vocab_size, hidden_size=1024, num_layers=4, dropout=0.2,
+                 batch_first=False, embedder=None, init_weight=0.1):
        """
        Constructor for the ResidualRecurrentEncoder.

        :param vocab_size: size of vocabulary
        :param hidden_size: hidden size for LSTM layers
        :param num_layers: number of LSTM layers, 1st layer is bidirectional
-        :param bias: enables bias in LSTM layers
-        :param dropout: probability of dropout (between LSTM layers)
+        :param dropout: probability of dropout (on input to LSTM layers)
        :param batch_first: if True the model uses (batch,seq,feature) tensors,
            if false the model uses (seq, batch, feature)
-        :param embedder: embedding module, if None constructor will create new
-            embedding layer
+        :param embedder: instance of nn.Embedding, if None constructor will
+            create new embedding layer
+        :param init_weight: range for the uniform initializer
        """
        super(ResidualRecurrentEncoder, self).__init__()
        self.batch_first = batch_first
        self.rnn_layers = nn.ModuleList()
+        # 1st LSTM layer, bidirectional
        self.rnn_layers.append(
-            nn.LSTM(hidden_size, hidden_size, num_layers=1, bias=bias,
+            nn.LSTM(hidden_size, hidden_size, num_layers=1, bias=True,
                    batch_first=batch_first, bidirectional=True))

+        # 2nd LSTM layer, with 2x larger input_size
        self.rnn_layers.append(
-            nn.LSTM((2 * hidden_size), hidden_size, num_layers=1, bias=bias,
+            nn.LSTM((2 * hidden_size), hidden_size, num_layers=1, bias=True,
                    batch_first=batch_first))

+        # Remaining LSTM layers
        for _ in range(num_layers - 2):
            self.rnn_layers.append(
-                nn.LSTM(hidden_size, hidden_size, num_layers=1, bias=bias,
+                nn.LSTM(hidden_size, hidden_size, num_layers=1, bias=True,
                        batch_first=batch_first))

+        for lstm in self.rnn_layers:
+            init_lstm_(lstm, init_weight)
+
        self.dropout = nn.Dropout(p=dropout)

        if embedder is not None:
@ -53,6 +60,7 @@ class ResidualRecurrentEncoder(nn.Module):
        else:
            self.embedder = nn.Embedding(vocab_size, hidden_size,
                                         padding_idx=config.PAD)
+            nn.init.uniform_(embedder.weight.data, -init_weight, init_weight)

    def forward(self, inputs, lengths):
        """
@ -66,6 +74,7 @@ class ResidualRecurrentEncoder(nn.Module):
        x = self.embedder(inputs)

        # bidirectional layer
+        x = self.dropout(x)
        x = pack_padded_sequence(x, lengths.cpu().numpy(),
                                 batch_first=self.batch_first)
        x, _ = self.rnn_layers[0](x)
--- a/seq2seq/models/gnmt.py
+++ b/seq2seq/models/gnmt.py
@ -1,18 +1,17 @@
 import torch.nn as nn

 import seq2seq.data.config as config
-from seq2seq.models.seq2seq_base import Seq2Seq
-from seq2seq.models.encoder import ResidualRecurrentEncoder
 from seq2seq.models.decoder import ResidualRecurrentDecoder
+from seq2seq.models.encoder import ResidualRecurrentEncoder
+from seq2seq.models.seq2seq_base import Seq2Seq


 class GNMT(Seq2Seq):
    """
    GNMT v2 model
    """
-    def __init__(self, vocab_size, hidden_size=512, num_layers=8, bias=True,
-                 dropout=0.2, batch_first=False, math='fp32',
-                 share_embedding=False):
+    def __init__(self, vocab_size, hidden_size=1024, num_layers=4, dropout=0.2,
+                 batch_first=False, share_embedding=True):
        """
        Constructor for the GNMT v2 model.

@ -20,11 +19,9 @@ class GNMT(Seq2Seq):
        :param hidden_size: internal hidden size of the model
        :param num_layers: number of layers, applies to both encoder and
            decoder
-        :param bias: globally enables or disables bias in encoder and decoder
        :param dropout: probability of dropout (in encoder and decoder)
        :param batch_first: if True the model uses (batch,seq,feature) tensors,
            if false the model uses (seq, batch, feature)
-        :param math: arithmetic type, 'fp32' or 'fp16'
        :param share_embedding: if True embeddings are shared between encoder
            and decoder
        """
@ -32,17 +29,19 @@ class GNMT(Seq2Seq):
        super(GNMT, self).__init__(batch_first=batch_first)

        if share_embedding:
-            embedder = nn.Embedding(vocab_size, hidden_size, padding_idx=config.PAD)
+            embedder = nn.Embedding(vocab_size, hidden_size,
+                                    padding_idx=config.PAD)
+            nn.init.uniform_(embedder.weight.data, -0.1, 0.1)
        else:
            embedder = None

        self.encoder = ResidualRecurrentEncoder(vocab_size, hidden_size,
-                                                num_layers, bias, dropout,
+                                                num_layers, dropout,
                                                batch_first, embedder)

        self.decoder = ResidualRecurrentDecoder(vocab_size, hidden_size,
-                                                num_layers, bias, dropout,
-                                                batch_first, math, embedder)
+                                                num_layers, dropout,
+                                                batch_first, embedder)

    def forward(self, input_encoder, input_enc_len, input_decoder):
        context = self.encode(input_encoder, input_enc_len)
--- a/seq2seq/train/distributed.py
+++ b/seq2seq/train/distributed.py
@ -1,222 +0,0 @@
-import torch
-from torch._utils import _flatten_dense_tensors, _unflatten_dense_tensors
-import torch.distributed as dist
-from torch.nn.modules import Module
-from torch.autograd import Variable
-from collections import OrderedDict
-
-
-def flat_dist_call(tensors, call, extra_args=None):
-    flat_dist_call.warn_on_half = True
-    buckets = OrderedDict()
-    for tensor in tensors:
-        tp = tensor.type()
-        if tp not in buckets:
-            buckets[tp] = []
-        buckets[tp].append(tensor)
-
-    if flat_dist_call.warn_on_half:
-        if torch.cuda.HalfTensor in buckets:
-            print("WARNING: gloo dist backend for half parameters may be extremely slow." +
-                  " It is recommended to use the NCCL backend in this case.")
-            flat_dist_call.warn_on_half = False
-
-    for tp in buckets:
-        bucket = buckets[tp]
-        coalesced = _flatten_dense_tensors(bucket)
-        if extra_args is not None:
-            call(coalesced, *extra_args)
-        else:
-            call(coalesced)
-        if call is dist.all_reduce:
-            coalesced /= dist.get_world_size()
-
-        for buf, synced in zip(bucket, _unflatten_dense_tensors(coalesced, bucket)):
-            buf.copy_(synced)
-
-class DistributedDataParallel(Module):
-    """
-    :class:`apex.parallel.DistributedDataParallel` is a module wrapper that enables
-    easy multiprocess distributed data parallel training, similar to ``torch.nn.parallel.DistributedDataParallel``.
-
-    :class:`DistributedDataParallel` is designed to work with
-    the launch utility script ``apex.parallel.multiproc.py``.
-    When used with ``multiproc.py``, :class:`DistributedDataParallel`
-    assigns 1 process to each of the available (visible) GPUs on the node.
-    Parameters are broadcast across participating processes on initialization, and gradients are
-    allreduced and averaged over processes during ``backward()``.
-
-    :class:`DistributedDataParallel` is optimized for use with NCCL.  It achieves high performance by
-    overlapping communication with computation during ``backward()`` and bucketing smaller gradient
-    transfers to reduce the total number of transfers required.
-
-    :class:`DistributedDataParallel` assumes that your script accepts the command line
-    arguments "rank" and "world-size."  It also assumes that your script calls
-    ``torch.cuda.set_device(args.rank)`` before creating the model.
-
-    https://github.com/NVIDIA/apex/tree/master/examples/distributed shows detailed usage.
-    https://github.com/NVIDIA/apex/tree/master/examples/imagenet shows another example
-    that combines :class:`DistributedDataParallel` with mixed precision training.
-
-    Args:
-        module: Network definition to be run in multi-gpu/distributed mode.
-        message_size (Default = 1e7): Minimum number of elements in a communication bucket.
-        shared_param (Default = False): If your model uses shared parameters this must be True.  It will disable bucketing of parameters to avoid race conditions.
-
-    """
-
-    def __init__(self, module, message_size=10000000, shared_param=False):
-        super(DistributedDataParallel, self).__init__()
-        self.warn_on_half = True if dist._backend == dist.dist_backend.GLOO else False
-        self.shared_param = shared_param
-        self.message_size = message_size
-
-        #reference to last iterations parameters to see if anything has changed
-        self.param_refs = []
-
-        self.reduction_stream = torch.cuda.Stream()
-
-        self.module = module
-        self.param_list = list(self.module.parameters())
-
-        if dist._backend == dist.dist_backend.NCCL:
-            for param in self.param_list:
-                assert param.is_cuda, "NCCL backend only supports model parameters to be on GPU."
-
-        self.record = []
-        self.create_hooks()
-
-        flat_dist_call([param.data for param in self.module.parameters()], dist.broadcast, (0,) )
-
-    def create_hooks(self):
-        #all reduce gradient hook
-        def allreduce_params():
-            if not self.needs_reduction:
-                return
-            self.needs_reduction = False
-
-            #parameter ordering refresh
-            if self.needs_refresh and not self.shared_param:
-                t_record = torch.cuda.IntTensor(self.record)
-                dist.broadcast(t_record, 0)
-                self.record = [int(entry) for entry in t_record]
-                self.needs_refresh = False
-
-            grads = [param.grad.data for param in self.module.parameters() if param.grad is not None]
-            flat_dist_call(grads, dist.all_reduce)
-
-        def flush_buckets():
-            if not self.needs_reduction:
-                return
-            self.needs_reduction = False
-
-            grads = []
-            for i in range(self.ready_end, len(self.param_state)):
-                param = self.param_refs[self.record[i]]
-                if param.grad is not None:
-                    grads.append(param.grad.data)
-            grads = [param.grad.data for param in self.ready_params] + grads
-
-            if(len(grads)>0):
-                orig_stream = torch.cuda.current_stream()
-                with torch.cuda.stream(self.reduction_stream):
-                    self.reduction_stream.wait_stream(orig_stream)
-                    flat_dist_call(grads, dist.all_reduce)
-
-            torch.cuda.current_stream().wait_stream(self.reduction_stream)
-
-        for param_i, param in enumerate(list(self.module.parameters())):
-            def wrapper(param_i):
-
-                def allreduce_hook(*unused):
-                    if self.needs_refresh:
-                        self.record.append(param_i)
-                        Variable._execution_engine.queue_callback(allreduce_params)
-                    else:
-                        Variable._execution_engine.queue_callback(flush_buckets)
-                        self.comm_ready_buckets(self.record.index(param_i))
-
-
-                if param.requires_grad:
-                    param.register_hook(allreduce_hook)
-            wrapper(param_i)
-
-
-    def comm_ready_buckets(self, param_ind):
-
-        if self.param_state[param_ind] != 0:
-            raise RuntimeError("Error: Your model uses shared parameters, DDP flag shared_params must be set to True in initialization.")
-
-
-        if self.param_state[self.ready_end] == 0:
-            self.param_state[param_ind] = 1
-            return
-
-
-        while self.ready_end < len(self.param_state) and self.param_state[self.ready_end] == 1:
-            self.ready_params.append(self.param_refs[self.record[self.ready_end]])
-            self.ready_numel += self.ready_params[-1].numel()
-            self.ready_end += 1
-
-
-        if self.ready_numel < self.message_size:
-            self.param_state[param_ind] = 1
-            return
-
-        grads = [param.grad.data for param in self.ready_params]
-
-        bucket = []
-        bucket_inds = []
-        while grads:
-            bucket.append(grads.pop(0))
-
-            cumm_size = 0
-            for ten in bucket:
-                cumm_size += ten.numel()
-
-            if cumm_size < self.message_size:
-                continue
-
-            evt = torch.cuda.Event()
-            evt.record(torch.cuda.current_stream())
-            evt.wait(stream=self.reduction_stream)
-
-            with torch.cuda.stream(self.reduction_stream):
-                flat_dist_call(bucket, dist.all_reduce)
-
-            for i in range(self.ready_start, self.ready_start+len(bucket)):
-                self.param_state[i] = 2
-                self.ready_params.pop(0)
-
-        self.param_state[param_ind] = 1
-
-    def forward(self, *inputs, **kwargs):
-
-        param_list = [param for param in list(self.module.parameters()) if param.requires_grad]
-
-
-        #Force needs_refresh to True if there are shared params
-        #this will force it to always, only call flush_buckets which is safe
-        #for shared parameters in the model.
-        #Parentheses are not necessary for correct order of operations, but make the intent clearer.
-        if (not self.param_refs) or self.shared_param:
-            self.needs_refresh = True
-        else:
-            self.needs_refresh = (
-                (len(param_list) != len(self.param_refs)) or any(
-                    [param1 is not param2 for param1, param2 in zip(param_list, self.param_refs)]))
-
-        if  self.needs_refresh:
-            self.record = []
-
-
-        self.param_state = [0 for i in range(len(param_list))]
-        self.param_refs = param_list
-        self.needs_reduction = True
-
-        self.ready_start = 0
-        self.ready_end   = 0
-        self.ready_params = []
-        self.ready_numel = 0
-
-        return self.module(*inputs, **kwargs)
--- a/seq2seq/train/fp_optimizers.py
+++ b/seq2seq/train/fp_optimizers.py
@ -35,7 +35,21 @@ class Fp16Optimizer:
            param.data.copy_(new_param.data)

    def __init__(self, fp16_model, grad_clip=float('inf'), loss_scale=8192,
-                 dls_downscale=2, dls_upscale=2, dls_upscale_interval=2048):
+                 dls_downscale=2, dls_upscale=2, dls_upscale_interval=128):
+        """
+        Constructor for the Fp16Optimizer.
+
+        :param fp16_model: model (previously casted to half)
+        :param grad_clip: coefficient for gradient clipping, max L2 norm of the
+            gradients
+        :param loss_scale: initial loss scale
+        :param dls_downscale: loss downscale factor, loss scale is divided by
+            this factor when NaN/INF occurs in the gradients
+        :param dls_upscale: loss upscale factor, loss scale is multiplied by
+            this factor if previous dls_upscale_interval batches finished
+            successfully
+        :param dls_upscale_interval: interval for loss scale upscaling
+        """
        logging.info('Initializing fp16 optimizer')
        self.initialize_model(fp16_model)

@ -61,7 +75,7 @@ class Fp16Optimizer:
        for param in self.fp32_params:
            param.requires_grad = True

-    def step(self, loss, optimizer, update=True):
+    def step(self, loss, optimizer, scheduler, update=True):
        """
        Performs one step of the optimizer.
        Applies loss scaling, computes gradients in fp16, converts gradients to
@ -76,21 +90,21 @@ class Fp16Optimizer:
        :param update: if True executes weight update
        """
        loss *= self.loss_scale
-
-        self.fp16_model.zero_grad()
        loss.backward()

-        self.set_grads(self.fp32_params, self.fp16_model.parameters())
-        if self.loss_scale != 1.0:
-            for param in self.fp32_params:
-                param.grad.data /= self.loss_scale
-
-        norm = clip_grad_norm_(self.fp32_params, self.grad_clip)
-
        if update:
+            self.set_grads(self.fp32_params, self.fp16_model.parameters())
+            if self.loss_scale != 1.0:
+                for param in self.fp32_params:
+                    param.grad.data /= self.loss_scale
+
+            norm = clip_grad_norm_(self.fp32_params, self.grad_clip)
+
            if math.isfinite(norm):
+                scheduler.step()
                optimizer.step()
-                self.set_weights(self.fp16_model.parameters(), self.fp32_params)
+                self.set_weights(self.fp16_model.parameters(),
+                                 self.fp32_params)
                self.since_last_invalid += 1
            else:
                self.loss_scale /= self.dls_downscale
@ -104,6 +118,8 @@ class Fp16Optimizer:
                logging.info(f'Upscaling, new scale: {self.loss_scale}')
                self.since_last_invalid = 0

+            self.fp16_model.zero_grad()
+

 class Fp32Optimizer:
    """
@ -114,7 +130,8 @@ class Fp32Optimizer:
        Constructor for the Fp32Optimizer

        :param model: model
-        :param grad_clip: max value of gradient norm
+        :param grad_clip: coefficient for gradient clipping, max L2 norm of the
+            gradients
        """
        logging.info('Initializing fp32 optimizer')
        self.initialize_model(model)
@ -129,7 +146,7 @@ class Fp32Optimizer:
        self.model = model
        self.model.zero_grad()

-    def step(self, loss, optimizer, update=True):
+    def step(self, loss, optimizer, scheduler, update=True):
        """
        Performs one step of the optimizer.

@ -138,8 +155,9 @@ class Fp32Optimizer:
        :param update: if True executes weight update
        """
        loss.backward()
-        if self.grad_clip != float('inf'):
-            clip_grad_norm_(self.model.parameters(), self.grad_clip)
        if update:
+            if self.grad_clip != float('inf'):
+                clip_grad_norm_(self.model.parameters(), self.grad_clip)
+            scheduler.step()
            optimizer.step()
-        self.model.zero_grad()
+            self.model.zero_grad()
--- a/seq2seq/train/lr_scheduler.py
+++ b/seq2seq/train/lr_scheduler.py
@ -0,0 +1,98 @@
+import logging
+import math
+
+import torch
+
+
+def perhaps_convert_float(param, total):
+    if isinstance(param, float):
+        param = int(param * total)
+    return param
+
+
+class WarmupMultiStepLR(torch.optim.lr_scheduler._LRScheduler):
+    """
+    Learning rate scheduler with exponential warmup and step decay.
+    """
+    def __init__(self, optimizer, iterations, warmup_steps=0,
+                 remain_steps=1.0, decay_interval=None, decay_steps=4,
+                 decay_factor=0.5, last_epoch=-1):
+        """
+        Constructor of WarmupMultiStepLR.
+
+        Parameters: warmup_steps, remain_steps and decay_interval accept both
+        integers and floats as an input. Integer input is interpreted as
+        absolute index of iteration, float input is interpreted as a fraction
+        of total training iterations (epochs * steps_per_epoch).
+
+        If decay_interval is None then the decay will happen at regulary spaced
+        intervals ('decay_steps' decays between iteration indices
+        'remain_steps' and 'iterations').
+
+        :param optimizer: instance of optimizer
+        :param iterations: total number of training iterations
+        :param warmup_steps: number of warmup iterations
+        :param remain_steps: start decay at 'remain_steps' iteration
+        :param decay_interval: interval between LR decay steps
+        :param decay_steps: max number of decay steps
+        :param decay_factor: decay factor
+        :param last_epoch: the index of last iteration
+        """
+
+        # iterations before learning rate reaches base LR
+        self.warmup_steps = perhaps_convert_float(warmup_steps, iterations)
+        logging.info(f'Scheduler warmup steps: {self.warmup_steps}')
+
+        # iteration at which decay starts
+        self.remain_steps = perhaps_convert_float(remain_steps, iterations)
+        logging.info(f'Scheduler remain steps: {self.remain_steps}')
+
+        # number of steps between each decay
+        if decay_interval is None:
+            # decay at regulary spaced intervals
+            decay_iterations = iterations - self.remain_steps
+            self.decay_interval = decay_iterations // (decay_steps)
+            self.decay_interval = max(self.decay_interval, 1)
+        else:
+            self.decay_interval = perhaps_convert_float(decay_interval,
+                                                        iterations)
+        logging.info(f'Scheduler decay interval: {self.decay_interval}')
+
+        # multiplicative decay factor
+        self.decay_factor = decay_factor
+        logging.info(f'Scheduler decay factor: {self.decay_factor}')
+
+        # max number of decay steps
+        self.decay_steps = decay_steps
+        logging.info(f'Scheduler max decay steps: {self.decay_steps}')
+
+        if self.warmup_steps > self.remain_steps:
+            logging.warn(f'warmup_steps should not be larger than '
+                         f'remain_steps, setting warmup_steps=remain_steps')
+            self.warmup_steps = self.remain_steps
+
+        super(WarmupMultiStepLR, self).__init__(optimizer, last_epoch)
+
+    def get_lr(self):
+        if self.last_epoch <= self.warmup_steps:
+            # exponential lr warmup
+            if self.warmup_steps != 0:
+                warmup_factor = math.exp(math.log(0.01) / self.warmup_steps)
+            else:
+                warmup_factor = 1.0
+            inv_decay = warmup_factor ** (self.warmup_steps - self.last_epoch)
+            lr = [base_lr * inv_decay for base_lr in self.base_lrs]
+
+        elif self.last_epoch >= self.remain_steps:
+            # step decay
+            decay_iter = self.last_epoch - self.remain_steps
+            num_decay_steps = decay_iter // self.decay_interval + 1
+            num_decay_steps = min(num_decay_steps, self.decay_steps)
+            lr = [
+                base_lr * (self.decay_factor ** num_decay_steps)
+                for base_lr in self.base_lrs
+                ]
+        else:
+            # base lr
+            lr = [base_lr for base_lr in self.base_lrs]
+        return lr
--- a/seq2seq/train/smoothing.py
+++ b/seq2seq/train/smoothing.py
@ -1,6 +1,7 @@
 import torch
 import torch.nn as nn

+
 class LabelSmoothing(nn.Module):
    """
    NLL loss with label smoothing.
@ -18,7 +19,8 @@ class LabelSmoothing(nn.Module):
        self.smoothing = smoothing

    def forward(self, x, target):
-        logprobs = torch.nn.functional.log_softmax(x, dim=-1)
+        logprobs = torch.nn.functional.log_softmax(x, dim=-1,
+                                                   dtype=torch.float32)

        non_pad_mask = (target != self.padding_idx)
        nll_loss = -logprobs.gather(dim=-1, index=target.unsqueeze(1))
--- a/seq2seq/train/trainer.py
+++ b/seq2seq/train/trainer.py
@ -1,15 +1,17 @@
 import logging
-import time
 import os
+import time
 from itertools import cycle

+import numpy as np
 import torch
 import torch.optim
 import torch.utils.data
-import numpy as np
+from apex.parallel import DistributedDataParallel as DDP

-from seq2seq.train.distributed import DistributedDataParallel as DDP
-from seq2seq.train.fp_optimizers import Fp16Optimizer, Fp32Optimizer
+from seq2seq.train.fp_optimizers import Fp16Optimizer
+from seq2seq.train.fp_optimizers import Fp32Optimizer
+from seq2seq.train.lr_scheduler import WarmupMultiStepLR
 from seq2seq.utils import AverageMeter
 from seq2seq.utils import sync_workers

@ -18,21 +20,54 @@ class Seq2SeqTrainer:
    """
    Seq2SeqTrainer
    """
-    def __init__(self, model, criterion, opt_config,
+    def __init__(self,
+                 model,
+                 criterion,
+                 opt_config,
+                 scheduler_config,
                 print_freq=10,
                 save_freq=1000,
                 grad_clip=float('inf'),
                 batch_first=False,
                 save_info={},
                 save_path='.',
+                 train_iterations=0,
                 checkpoint_filename='checkpoint%s.pth',
                 keep_checkpoints=5,
                 math='fp32',
                 cuda=True,
                 distributed=False,
                 intra_epoch_eval=0,
+                 iter_size=1,
                 translator=None,
                 verbose=False):
+        """
+        Constructor for the Seq2SeqTrainer.
+
+        :param model: model to train
+        :param criterion: criterion (loss function)
+        :param opt_config: dictionary with options for the optimizer
+        :param scheduler_config: dictionary with options for the learning rate
+            scheduler
+        :param print_freq: prints short summary every 'print_freq' iterations
+        :param save_freq: saves checkpoint every 'save_freq' iterations
+        :param grad_clip: coefficient for gradient clipping
+        :param batch_first: if True the model uses (batch,seq,feature) tensors,
+            if false the model uses (seq, batch, feature)
+        :param save_info: dict with additional state stored in each checkpoint
+        :param save_path: path to the directiory for checkpoints
+        :param train_iterations: total number of training iterations to execute
+        :param checkpoint_filename: name of files with checkpoints
+        :param keep_checkpoints: max number of checkpoints to keep
+        :param math: arithmetic type
+        :param cuda: if True use cuda, if False train on cpu
+        :param distributed: if True run distributed training
+        :param intra_epoch_eval: number of additional eval runs within each
+            training epoch
+        :param iter_size: number of iterations between weight updates
+        :param translator: instance of Translator, runs inference on test set
+        :param verbose: enables verbose logging
+        """
        super(Seq2SeqTrainer, self).__init__()
        self.model = model
        self.criterion = criterion
@ -52,16 +87,19 @@ class Seq2SeqTrainer:
        self.loss = None
        self.translator = translator
        self.intra_epoch_eval = intra_epoch_eval
+        self.iter_size = iter_size

        if cuda:
            self.model = self.model.cuda()
            self.criterion = self.criterion.cuda()

+        if math == 'fp16':
+            self.model = self.model.half()
+
        if distributed:
            self.model = DDP(self.model)

        if math == 'fp16':
-            self.model = self.model.half()
            self.fp_optimizer = Fp16Optimizer(self.model, grad_clip)
            params = self.fp_optimizer.fp32_params
        elif math == 'fp32':
@ -72,7 +110,18 @@ class Seq2SeqTrainer:
        self.optimizer = torch.optim.__dict__[opt_name](params, **opt_config)
        logging.info(f'Using optimizer: {self.optimizer}')

+        self.scheduler = WarmupMultiStepLR(self.optimizer, train_iterations,
+                                           **scheduler_config)
+
    def iterate(self, src, tgt, update=True, training=True):
+        """
+        Performs one iteration of the training/validation.
+
+        :param src: batch of examples from the source language
+        :param tgt: batch of examples from the target language
+        :param update: if True: optimizer does update of the weights
+        :param training: if True: executes optimizer
+        """
        src, src_length = src
        tgt, tgt_length = tgt
        src_length = torch.LongTensor(src_length)
@ -96,14 +145,15 @@ class Seq2SeqTrainer:
            tgt_labels = tgt[1:]
            T, B = output.size(0), output.size(1)

-        loss = self.criterion(output.view(T * B, -1).float(),
+        loss = self.criterion(output.view(T * B, -1),
                              tgt_labels.contiguous().view(-1))

        loss_per_batch = loss.item()
-        loss /= B
+        loss /= (B * self.iter_size)

        if training:
-            self.fp_optimizer.step(loss, self.optimizer, update)
+            self.fp_optimizer.step(loss, self.optimizer, self.scheduler,
+                                   update)

        loss_per_token = loss_per_batch / num_toks['tgt']
        loss_per_sentence = loss_per_batch / B
@ -120,13 +170,15 @@ class Seq2SeqTrainer:
        if training:
            assert self.optimizer is not None
            eval_fractions = np.linspace(0, 1, self.intra_epoch_eval+2)[1:-1]
-            eval_iters = (eval_fractions * len(data_loader)).astype(int)
+            iters_with_update = len(data_loader) // self.iter_size
+            eval_iters = (eval_fractions * iters_with_update).astype(int)
+            eval_iters = eval_iters * self.iter_size
            eval_iters = set(eval_iters)

        batch_time = AverageMeter()
        data_time = AverageMeter()
-        losses_per_token = AverageMeter()
-        losses_per_sentence = AverageMeter()
+        losses_per_token = AverageMeter(skip_first=False)
+        losses_per_sentence = AverageMeter(skip_first=False)

        tot_tok_time = AverageMeter()
        src_tok_time = AverageMeter()
@ -140,8 +192,12 @@ class Seq2SeqTrainer:
            # measure data loading time
            data_time.update(time.time() - end)

+            update = False
+            if i % self.iter_size == self.iter_size - 1:
+                update = True
+
            # do a train/evaluate iteration
-            stats = self.iterate(src, tgt, training=training)
+            stats = self.iterate(src, tgt, update, training=training)
            loss_per_token, loss_per_sentence, num_toks = stats

            # measure accuracy and record loss
@ -176,13 +232,16 @@ class Seq2SeqTrainer:
                log = []
                log += [f'{phase} [{self.epoch}][{i}/{len(data_loader)}]']
                log += [f'Time {batch_time.val:.3f} ({batch_time.avg:.3f})']
-                log += [f'Data {data_time.val:.3f} ({data_time.avg:.3f})']
+                log += [f'Data {data_time.val:.2e} ({data_time.avg:.2e})']
                log += [f'Tok/s {tot_tok_time.val:.0f} ({tot_tok_time.avg:.0f})']
                if self.verbose:
                    log += [f'Src tok/s {src_tok_time.val:.0f} ({src_tok_time.avg:.0f})']
                    log += [f'Tgt tok/s {tgt_tok_time.val:.0f} ({tgt_tok_time.avg:.0f})']
                    log += [f'Loss/sentence {losses_per_sentence.val:.1f} ({losses_per_sentence.avg:.1f})']
                log += [f'Loss/tok {losses_per_token.val:.4f} ({losses_per_token.avg:.4f})']
+                if training:
+                    lr = self.optimizer.param_groups[0]['lr']
+                    log += [f'LR {lr:.3e}']
                log = '\t'.join(log)
                logging.info(log)

@ -198,9 +257,8 @@ class Seq2SeqTrainer:

            end = time.time()

-        if training:
-            tot_tok_time.reduce('sum')
-            losses_per_token.reduce('mean')
+        tot_tok_time.reduce('sum')
+        losses_per_token.reduce('mean')

        return losses_per_token.avg, tot_tok_time.avg

@ -228,6 +286,7 @@ class Seq2SeqTrainer:
        src = src, src_length
        tgt = tgt, tgt_length
        self.iterate(src, tgt, update=False, training=training)
+        self.model.zero_grad()

    def optimize(self, data_loader):
        """
@ -241,6 +300,7 @@ class Seq2SeqTrainer:
        torch.cuda.empty_cache()
        self.preallocate(data_loader, training=True)
        output = self.feed_data(data_loader, training=True)
+        self.model.zero_grad()
        torch.cuda.empty_cache()
        return output

@ -256,6 +316,7 @@ class Seq2SeqTrainer:
        torch.cuda.empty_cache()
        self.preallocate(data_loader, training=False)
        output = self.feed_data(data_loader, training=False)
+        self.model.zero_grad()
        torch.cuda.empty_cache()
        return output

@ -267,9 +328,13 @@ class Seq2SeqTrainer:
        """
        if os.path.isfile(filename):
            checkpoint = torch.load(filename, map_location={'cuda:0': 'cpu'})
-            self.model.load_state_dict(checkpoint['state_dict'])
+            if self.distributed:
+                self.model.module.load_state_dict(checkpoint['state_dict'])
+            else:
+                self.model.load_state_dict(checkpoint['state_dict'])
            self.fp_optimizer.initialize_model(self.model)
            self.optimizer.load_state_dict(checkpoint['optimizer'])
+            self.scheduler.load_state_dict(checkpoint['scheduler'])
            self.epoch = checkpoint['epoch']
            self.loss = checkpoint['loss']
            logging.info(f'Loaded checkpoint {filename} (epoch {self.epoch})')
@ -291,10 +356,16 @@ class Seq2SeqTrainer:
            logging.info(f'Saving model to {filename}')
            torch.save(state, filename)

+        if self.distributed:
+            model_state = self.model.module.state_dict()
+        else:
+            model_state = self.model.state_dict()
+
        state = {
            'epoch': self.epoch,
-            'state_dict': self.model.state_dict(),
+            'state_dict': model_state,
            'optimizer': self.optimizer.state_dict(),
+            'scheduler': self.scheduler.state_dict(),
            'loss': getattr(self, 'loss', None),
        }
        state = dict(list(state.items()) + list(self.save_info.items()))
--- a/seq2seq/utils.py
+++ b/seq2seq/utils.py
@ -1,9 +1,111 @@
-from contextlib import contextmanager
 import logging.config
+import os
+import random
+import sys
+import time
+from contextlib import contextmanager

 import numpy as np
 import torch
 import torch.distributed as dist
+import torch.nn.init as init
+import torch.utils.collect_env
+
+
+def init_lstm_(lstm, init_weight=0.1):
+    """
+    Initializes weights of LSTM layer.
+    Weights and biases are initialized with uniform(-init_weight, init_weight)
+    distribution.
+
+    :param lstm: instance of torch.nn.LSTM
+    :param init_weight: range for the uniform initializer
+    """
+    # Initialize hidden-hidden weights
+    init.uniform_(lstm.weight_hh_l0.data, -init_weight, init_weight)
+    # Initialize input-hidden weights:
+    init.uniform_(lstm.weight_ih_l0.data, -init_weight, init_weight)
+
+    # Initialize bias. PyTorch LSTM has two biases, one for input-hidden GEMM
+    # and the other for hidden-hidden GEMM. Here input-hidden bias is
+    # initialized with uniform distribution and hidden-hidden bias is
+    # initialized with zeros.
+    init.uniform_(lstm.bias_ih_l0.data, -init_weight, init_weight)
+    init.zeros_(lstm.bias_hh_l0.data)
+
+    if lstm.bidirectional:
+        init.uniform_(lstm.weight_hh_l0_reverse.data, -init_weight, init_weight)
+        init.uniform_(lstm.weight_ih_l0_reverse.data, -init_weight, init_weight)
+
+        init.uniform_(lstm.bias_ih_l0_reverse.data, -init_weight, init_weight)
+        init.zeros_(lstm.bias_hh_l0_reverse.data)
+
+
+def generate_seeds(rng, size):
+    """
+    Generate list of random seeds
+
+    :param rng: random number generator
+    :param size: length of the returned list
+    """
+    seeds = [rng.randint(0, 2**32 - 1) for _ in range(size)]
+    return seeds
+
+
+def broadcast_seeds(seeds, device):
+    """
+    Broadcasts random seeds to all distributed workers.
+    Returns list of random seeds (broadcasted from workers with rank 0).
+
+    :param seeds: list of seeds (integers)
+    :param device: torch.device
+    """
+    if torch.distributed.is_available() and torch.distributed.is_initialized():
+        seeds_tensor = torch.LongTensor(seeds).to(device)
+        torch.distributed.broadcast(seeds_tensor, 0)
+        seeds = seeds_tensor.tolist()
+    return seeds
+
+
+def setup_seeds(master_seed, epochs, device):
+    """
+    Generates seeds from one master_seed.
+    Function returns (worker_seeds, shuffling_seeds), worker_seeds are later
+    used to initialize per-worker random number generators (mostly for
+    dropouts), shuffling_seeds are for RNGs resposible for reshuffling the
+    dataset before each epoch.
+    Seeds are generated on worker with rank 0 and broadcasted to all other
+    workers.
+
+    :param master_seed: master RNG seed used to initialize other generators
+    :param epochs: number of epochs
+    :param device: torch.device (used for distributed.broadcast)
+    """
+    if master_seed is None:
+        # random master seed, random.SystemRandom() uses /dev/urandom on Unix
+        master_seed = random.SystemRandom().randint(0, 2**32 - 1)
+        if get_rank() == 0:
+            # master seed is reported only from rank=0 worker, it's to avoid
+            # confusion, seeds from rank=0 are later broadcasted to other
+            # workers
+            logging.info(f'Using random master seed: {master_seed}')
+    else:
+        # master seed was specified from command line
+        logging.info(f'Using master seed from command line: {master_seed}')
+
+    # initialize seeding RNG
+    seeding_rng = random.Random(master_seed)
+
+    # generate worker seeds, one seed for every distributed worker
+    worker_seeds = generate_seeds(seeding_rng, get_world_size())
+
+    # generate seeds for data shuffling, one seed for every epoch
+    shuffling_seeds = generate_seeds(seeding_rng, epochs)
+
+    # broadcast seeds from rank=0 to other workers
+    worker_seeds = broadcast_seeds(worker_seeds, device)
+    shuffling_seeds = broadcast_seeds(shuffling_seeds, device)
+    return worker_seeds, shuffling_seeds


 def barrier():
@ -12,7 +114,7 @@ def barrier():
    doesn't implement barrier for NCCL backend.
    Calls all_reduce on dummy tensor and synchronizes with GPU.
    """
-    if torch.distributed.is_initialized():
+    if torch.distributed.is_available() and torch.distributed.is_initialized():
        torch.distributed.all_reduce(torch.cuda.FloatTensor(1))
        torch.cuda.synchronize()

@ -21,7 +123,7 @@ def get_rank():
    """
    Gets distributed rank or returns zero if distributed is not initialized.
    """
-    if torch.distributed.is_initialized():
+    if torch.distributed.is_available() and torch.distributed.is_initialized():
        rank = torch.distributed.get_rank()
    else:
        rank = 0
@ -33,7 +135,7 @@ def get_world_size():
    Gets total number of distributed workers or returns one if distributed is
    not initialized.
    """
-    if torch.distributed.is_initialized():
+    if torch.distributed.is_available() and torch.distributed.is_initialized():
        world_size = torch.distributed.get_world_size()
    else:
        world_size = 1
@ -50,7 +152,20 @@ def sync_workers():
    barrier()


-def setup_logging(log_file='log.log'):
+@contextmanager
+def timer(name, ndigits=2, sync_gpu=True):
+    if sync_gpu:
+        torch.cuda.synchronize()
+    start = time.time()
+    yield
+    if sync_gpu:
+        torch.cuda.synchronize()
+    stop = time.time()
+    elapsed = round(stop - start, ndigits)
+    logging.info(f'TIMER {name} {elapsed}')
+
+
+def setup_logging(log_file=os.devnull):
    """
    Configures logging.
    By default logs from all workers are printed to the console, entries are
@ -69,12 +184,13 @@ def setup_logging(log_file='log.log'):
    rank = get_rank()
    rank_filter = RankFilter(rank)

+    logging_format = "%(asctime)s - %(levelname)s - %(rank)s - %(message)s"
    logging.basicConfig(level=logging.DEBUG,
-                        format="%(asctime)s - %(levelname)s - %(rank)s - %(message)s",
+                        format=logging_format,
                        datefmt="%Y-%m-%d %H:%M:%S",
                        filename=log_file,
                        filemode='w')
-    console = logging.StreamHandler()
+    console = logging.StreamHandler(sys.stdout)
    console.setLevel(logging.INFO)
    formatter = logging.Formatter('%(rank)s: %(message)s')
    console.setFormatter(formatter)
@ -82,6 +198,73 @@ def setup_logging(log_file='log.log'):
    logging.getLogger('').addFilter(rank_filter)


+def set_device(cuda, local_rank):
+    """
+    Sets device based on local_rank and returns instance of torch.device.
+
+    :param cuda: if True: use cuda
+    :param local_rank: local rank of the worker
+    """
+    if cuda:
+        torch.cuda.set_device(local_rank)
+        device = torch.device('cuda')
+    else:
+        device = torch.device('cpu')
+    return device
+
+
+def init_distributed(cuda):
+    """
+    Initializes distributed backend.
+
+    :param cuda: (bool) if True initializes nccl backend, if False initializes
+        gloo backend
+    """
+    world_size = int(os.environ.get('WORLD_SIZE', 1))
+    distributed = (world_size > 1)
+    if distributed:
+        backend = 'nccl' if cuda else 'gloo'
+        dist.init_process_group(backend=backend,
+                                init_method='env://')
+        assert dist.is_initialized()
+    return distributed
+
+
+def log_env_info():
+    """
+    Prints information about execution environment.
+    """
+    logging.info('Collecting environment information...')
+    env_info = torch.utils.collect_env.get_pretty_env_info()
+    logging.info(f'{env_info}')
+
+
+def pad_vocabulary(math):
+    if math == 'fp16':
+        pad_vocab = 8
+    elif math == 'fp32':
+        pad_vocab = 1
+    return pad_vocab
+
+
+def benchmark(test_acc, target_acc, test_perf, target_perf):
+    def test(achieved, target, name):
+        passed = True
+        if target is not None and achieved is not None:
+            logging.info(f'{name} achieved: {achieved:.2f} '
+                         f'target: {target:.2f}')
+            if achieved >= target:
+                logging.info(f'{name} test passed')
+            else:
+                logging.info(f'{name} test failed')
+                passed = False
+        return passed
+
+    passed = True
+    passed &= test(test_acc, target_acc, 'Accuracy')
+    passed &= test(test_perf, target_perf, 'Performance')
+    return passed
+
 class AverageMeter:
    """
    Computes and stores the average and current value
@ -117,18 +300,42 @@ class AverageMeter:

        distributed = (get_world_size() > 1)
        if distributed:
-            cuda = (dist._backend == dist.dist_backend.NCCL)
+            # Backward/forward compatibility around
+            # https://github.com/pytorch/pytorch/commit/540ef9b1fc5506369a48491af8a285a686689b36 and
+            # https://github.com/pytorch/pytorch/commit/044d00516ccd6572c0d6ab6d54587155b02a3b86
+            # To accomodate change in Pytorch's distributed API
+            if hasattr(dist, "get_backend"):
+                _backend = dist.get_backend()
+                if hasattr(dist, "DistBackend"):
+                    backend_enum_holder = dist.DistBackend
+                else:
+                    backend_enum_holder = dist.Backend
+            else:
+                _backend = dist._backend
+                backend_enum_holder = dist.dist_backend
+
+            cuda = _backend == backend_enum_holder.NCCL

            if cuda:
                avg = torch.cuda.FloatTensor([self.avg])
+                _sum = torch.cuda.FloatTensor([self.sum])
            else:
                avg = torch.FloatTensor([self.avg])
+                _sum = torch.FloatTensor([self.sum])

-            dist.all_reduce(avg, op=dist.reduce_op.SUM)
+            try:
+                _reduce_op = dist.ReduceOp
+            except AttributeError:
+                _reduce_op = dist.reduce_op
+
+            dist.all_reduce(avg, op=_reduce_op.SUM)
+            dist.all_reduce(_sum, op=_reduce_op.SUM)
            self.avg = avg.item()
+            self.sum = _sum.item()

            if op == 'mean':
                self.avg /= get_world_size()
+                self.sum /= get_world_size()


 def debug_tensor(tensor, name):
--- a/train.py
+++ b/train.py
@ -1,130 +1,187 @@
 #!/usr/bin/env python
 import argparse
-import os
 import logging
+import os
+import sys
 from ast import literal_eval

 import torch.nn as nn
 import torch.nn.parallel
-import torch.utils.data.distributed
-import torch.distributed as dist
 import torch.optim
+import torch.utils.data.distributed

-from seq2seq.models.gnmt import GNMT
-from seq2seq.train.smoothing import LabelSmoothing
-from seq2seq.data.dataset import TextDataset
-from seq2seq.data.dataset import ParallelDataset
-from seq2seq.data.tokenizer import Tokenizer
-from seq2seq.utils import setup_logging
 import seq2seq.data.config as config
 import seq2seq.train.trainer as trainers
+import seq2seq.utils as utils
+from seq2seq.data.dataset import LazyParallelDataset
+from seq2seq.data.dataset import ParallelDataset
+from seq2seq.data.dataset import TextDataset
+from seq2seq.data.tokenizer import Tokenizer
 from seq2seq.inference.inference import Translator
+from seq2seq.models.gnmt import GNMT
+from seq2seq.train.smoothing import LabelSmoothing


 def parse_args():
    """
    Parse commandline arguments.
    """
-    parser = argparse.ArgumentParser(description='GNMT training',
-                                     formatter_class=argparse.ArgumentDefaultsHelpFormatter)
+    def exclusive_group(group, name, default, help):
+        destname = name.replace('-', '_')
+        subgroup = group.add_mutually_exclusive_group(required=False)
+        subgroup.add_argument(f'--{name}', dest=f'{destname}',
+                              action='store_true',
+                              help=f'{help} (use \'--no-{name}\' to disable)')
+        subgroup.add_argument(f'--no-{name}', dest=f'{destname}',
+                              action='store_false', help=argparse.SUPPRESS)
+        subgroup.set_defaults(**{destname: default})
+
+    parser = argparse.ArgumentParser(
+        description='GNMT training',
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter)

    # dataset
    dataset = parser.add_argument_group('dataset setup')
-    dataset.add_argument('--dataset-dir', default=None, required=True,
-                         help='path to directory with training/validation data')
+    dataset.add_argument('--dataset-dir', default='data/wmt16_de_en',
+                         help='path to the directory with training/test data')
    dataset.add_argument('--max-size', default=None, type=int,
                         help='use at most MAX_SIZE elements from training \
-                        dataset (useful for benchmarking), by default \
-                        uses entire dataset')
+                         dataset (useful for benchmarking), by default \
+                         uses entire dataset')

    # results
    results = parser.add_argument_group('results setup')
    results.add_argument('--results-dir', default='results',
-                         help='path to directory with results, it it will be \
-                        automatically created if does not exist')
-    results.add_argument('--save', default='gnmt_wmt16',
+                         help='path to directory with results, it will be \
+                         automatically created if it does not exist')
+    results.add_argument('--save', default='gnmt',
                         help='defines subdirectory within RESULTS_DIR for \
-                        results from this training run')
+                         results from this training run')
    results.add_argument('--print-freq', default=10, type=int,
                         help='print log every PRINT_FREQ batches')

    # model
    model = parser.add_argument_group('model setup')
-    model.add_argument('--model-config',
-                       default="{'hidden_size': 1024,'num_layers': 4, \
-                        'dropout': 0.2, 'share_embedding': True}",
-                       help='GNMT architecture configuration')
+    model.add_argument('--hidden-size', default=1024, type=int,
+                       help='model hidden size')
+    model.add_argument('--num-layers', default=4, type=int,
+                       help='number of RNN layers in encoder and in decoder')
+    model.add_argument('--dropout', default=0.2, type=float,
+                       help='dropout applied to input of RNN cells')
+
+    exclusive_group(group=model, name='share-embedding', default=True,
+                    help='use shared embeddings for encoder and decoder')
+
    model.add_argument('--smoothing', default=0.1, type=float,
                       help='label smoothing, if equal to zero model will use \
-                        CrossEntropyLoss, if not zero model will be trained \
-                        with label smoothing loss')
+                       CrossEntropyLoss, if not zero model will be trained \
+                       with label smoothing loss')

    # setup
    general = parser.add_argument_group('general setup')
-    general.add_argument('--math', default='fp16', choices=['fp32', 'fp16'],
+    general.add_argument('--math', default='fp16', choices=['fp16', 'fp32'],
                         help='arithmetic type')
    general.add_argument('--seed', default=None, type=int,
-                         help='set random number generator seed')
-    general.add_argument('--disable-eval', action='store_true', default=False,
-                         help='disables validation after every epoch')
-    general.add_argument('--workers', default=0, type=int,
-                         help='number of workers for data loading')
+                         help='master seed for random number generators, if \
+                         "seed" is undefined then the master seed will be \
+                         sampled from random.SystemRandom()')

-    cuda_parser = general.add_mutually_exclusive_group(required=False)
-    cuda_parser.add_argument('--cuda', dest='cuda', action='store_true',
-                             help='enables cuda (use \'--no-cuda\' to disable)')
-    cuda_parser.add_argument('--no-cuda', dest='cuda', action='store_false',
-                             help=argparse.SUPPRESS)
-    cuda_parser.set_defaults(cuda=True)
-
-    cudnn_parser = general.add_mutually_exclusive_group(required=False)
-    cudnn_parser.add_argument('--cudnn', dest='cudnn', action='store_true',
-                              help='enables cudnn (use \'--no-cudnn\' to disable)')
-    cudnn_parser.add_argument('--no-cudnn', dest='cudnn', action='store_false',
-                              help=argparse.SUPPRESS)
-    cudnn_parser.set_defaults(cudnn=True)
+    exclusive_group(group=general, name='eval', default=True,
+                    help='run validation and test after every epoch')
+    exclusive_group(group=general, name='env', default=True,
+                    help='print info about execution env')
+    exclusive_group(group=general, name='cuda', default=True,
+                    help='enables cuda')
+    exclusive_group(group=general, name='cudnn', default=True,
+                    help='enables cudnn')

    # training
    training = parser.add_argument_group('training setup')
-    training.add_argument('--batch-size', default=128, type=int,
-                          help='batch size for training')
-    training.add_argument('--epochs', default=8, type=int,
-                          help='number of total epochs to run')
-    training.add_argument('--optimization-config',
-                          default="{'optimizer': 'Adam', 'lr': 5e-4}", type=str,
-                          help='optimizer config')
-    training.add_argument('--grad-clip', default=5.0, type=float,
-                          help='enabled gradient clipping and sets maximum \
-                        gradient norm value')
-    training.add_argument('--max-length-train', default=50, type=int,
-                          help='maximum sequence length for training')
-    training.add_argument('--min-length-train', default=0, type=int,
-                          help='minimum sequence length for training')
+    training.add_argument('--train-batch-size', default=128, type=int,
+                          help='training batch size per worker')
+    training.add_argument('--train-global-batch-size', default=None, type=int,
+                          help='global training batch size, this argument \
+                          does not have to be defined, if it is defined it \
+                          will be used to automatically \
+                          compute train_iter_size \
+                          using the equation: train_iter_size = \
+                          train_global_batch_size // (train_batch_size * \
+                          world_size)')
+    training.add_argument('--train-iter-size', metavar='N', default=1,
+                          type=int,
+                          help='training iter size, training loop will \
+                          accumulate gradients over N iterations and execute \
+                          optimizer every N steps')
+    training.add_argument('--epochs', default=6, type=int,
+                          help='max number of training epochs')

-    bucketing_parser = training.add_mutually_exclusive_group(required=False)
-    bucketing_parser.add_argument('--bucketing', dest='bucketing', action='store_true',
-                                  help='enables bucketing (use \'--no-bucketing\' to disable)')
-    bucketing_parser.add_argument('--no-bucketing', dest='bucketing', action='store_false',
-                                  help=argparse.SUPPRESS)
-    bucketing_parser.set_defaults(bucketing=True)
+    training.add_argument('--grad-clip', default=5.0, type=float,
+                          help='enables gradient clipping and sets maximum \
+                          norm of gradients')
+    training.add_argument('--max-length-train', default=50, type=int,
+                          help='maximum sequence length for training \
+                          (including special BOS and EOS tokens)')
+    training.add_argument('--min-length-train', default=0, type=int,
+                          help='minimum sequence length for training \
+                          (including special BOS and EOS tokens)')
+    training.add_argument('--train-loader-workers', default=2, type=int,
+                          help='number of workers for training data loading')
+    training.add_argument('--batching', default='bucketing', type=str,
+                          choices=['random', 'sharding', 'bucketing'],
+                          help='select batching algorithm')
+    training.add_argument('--shard-size', default=80, type=int,
+                          help='shard size for "sharding" batching algorithm, \
+                          in multiples of global batch size')
+    training.add_argument('--num-buckets', default=5, type=int,
+                          help='number of buckets for "bucketing" batching \
+                          algorithm')
+
+    # optimizer
+    optimizer = parser.add_argument_group('optimizer setup')
+    optimizer.add_argument('--optimizer', type=str, default='Adam',
+                           help='training optimizer')
+    optimizer.add_argument('--lr', type=float, default=2.00e-3,
+                           help='learning rate')
+    optimizer.add_argument('--optimizer-extra', type=str,
+                           default="{}",
+                           help='extra options for the optimizer')
+
+    # scheduler
+    scheduler = parser.add_argument_group('learning rate scheduler setup')
+    scheduler.add_argument('--warmup-steps', type=str, default='200',
+                           help='number of learning rate warmup iterations')
+    scheduler.add_argument('--remain-steps', type=str, default='0.666',
+                           help='starting iteration for learning rate decay')
+    scheduler.add_argument('--decay-interval', type=str, default='None',
+                           help='interval between learning rate decay steps')
+    scheduler.add_argument('--decay-steps', type=int, default=4,
+                           help='max number of learning rate decay steps')
+    scheduler.add_argument('--decay-factor', type=float, default=0.5,
+                           help='learning rate decay factor')

    # validation
-    validation = parser.add_argument_group('validation setup')
-    validation.add_argument('--val-batch-size', default=128, type=int,
-                            help='batch size for validation')
-    validation.add_argument('--max-length-val', default=80, type=int,
-                            help='maximum sequence length for validation')
-    validation.add_argument('--min-length-val', default=0, type=int,
-                            help='minimum sequence length for validation')
+    val = parser.add_argument_group('validation setup')
+    val.add_argument('--val-batch-size', default=64, type=int,
+                     help='batch size for validation')
+    val.add_argument('--max-length-val', default=125, type=int,
+                     help='maximum sequence length for validation \
+                     (including special BOS and EOS tokens)')
+    val.add_argument('--min-length-val', default=0, type=int,
+                     help='minimum sequence length for validation \
+                     (including special BOS and EOS tokens)')
+    val.add_argument('--val-loader-workers', default=0, type=int,
+                     help='number of workers for validation data loading')

    # test
    test = parser.add_argument_group('test setup')
    test.add_argument('--test-batch-size', default=128, type=int,
                      help='batch size for test')
    test.add_argument('--max-length-test', default=150, type=int,
-                      help='maximum sequence length for test')
+                      help='maximum sequence length for test \
+                      (including special BOS and EOS tokens)')
    test.add_argument('--min-length-test', default=0, type=int,
-                      help='minimum sequence length for test')
+                      help='minimum sequence length for test \
+                      (including special BOS and EOS tokens)')
    test.add_argument('--beam-size', default=5, type=int,
                      help='beam size')
    test.add_argument('--len-norm-factor', default=0.6, type=float,
@ -133,36 +190,50 @@ def parse_args():
                      help='coverage penalty factor')
    test.add_argument('--len-norm-const', default=5.0, type=float,
                      help='length normalization constant')
-    test.add_argument('--target-bleu', default=None, type=float,
-                      help='target accuracy')
-    test.add_argument('--intra-epoch-eval', default=0, type=int,
-                      help='evaluate within epoch')
+    test.add_argument('--intra-epoch-eval', metavar='N', default=0, type=int,
+                      help='evaluate within training epoch, this option will \
+                      enable extra N equally spaced evaluations executed \
+                      during each training epoch')
+    test.add_argument('--test-loader-workers', default=0, type=int,
+                      help='number of workers for test data loading')

    # checkpointing
-    checkpoint = parser.add_argument_group('checkpointing setup')
-    checkpoint.add_argument('--start-epoch', default=0, type=int,
-                            help='manually set initial epoch counter')
-    checkpoint.add_argument('--resume', default=None, type=str, metavar='PATH',
-                            help='resumes training from checkpoint from PATH')
-    checkpoint.add_argument('--save-all', action='store_true', default=False,
-                            help='saves checkpoint after every epoch')
-    checkpoint.add_argument('--save-freq', default=5000, type=int,
-                            help='save checkpoint every SAVE_FREQ batches')
-    checkpoint.add_argument('--keep-checkpoints', default=0, type=int,
-                            help='keep only last KEEP_CHECKPOINTS checkpoints, \
-                        affects only checkpoints controlled by --save-freq \
-                        option')
+    chkpt = parser.add_argument_group('checkpointing setup')
+    chkpt.add_argument('--start-epoch', default=0, type=int,
+                       help='manually set initial epoch counter')
+    chkpt.add_argument('--resume', default=None, type=str, metavar='PATH',
+                       help='resumes training from checkpoint from PATH')
+    chkpt.add_argument('--save-all', action='store_true', default=False,
+                       help='saves checkpoint after every epoch')
+    chkpt.add_argument('--save-freq', default=5000, type=int,
+                       help='save checkpoint every SAVE_FREQ batches')
+    chkpt.add_argument('--keep-checkpoints', default=0, type=int,
+                       help='keep only last KEEP_CHECKPOINTS checkpoints, \
+                       affects only checkpoints controlled by --save-freq \
+                       option')

-    # distributed support
+    # benchmarking
+    benchmark = parser.add_argument_group('benchmark setup')
+    benchmark.add_argument('--target-perf', default=None, type=float,
+                           help='target training performance (in tokens \
+                           per second)')
+    benchmark.add_argument('--target-bleu', default=None, type=float,
+                           help='target accuracy')
+
+    # distributed
    distributed = parser.add_argument_group('distributed setup')
    distributed.add_argument('--rank', default=0, type=int,
-                             help='rank of the process, do not set! Done by multiproc module')
-    distributed.add_argument('--world-size', default=1, type=int,
-                             help='number of processes, do not set! Done by multiproc module')
-    distributed.add_argument('--dist-url', default='tcp://localhost:23456', type=str,
-                             help='url used to set up distributed training')
+                             help='global rank of the process, do not set!')
+    distributed.add_argument('--local_rank', default=0, type=int,
+                             help='local rank of the process, do not set!')

-    return parser.parse_args()
+    args = parser.parse_args()
+
+    args.warmup_steps = literal_eval(args.warmup_steps)
+    args.remain_steps = literal_eval(args.remain_steps)
+    args.decay_interval = literal_eval(args.decay_interval)
+
+    return args


 def build_criterion(vocab_size, padding_idx, smoothing):
@ -178,24 +249,18 @@ def build_criterion(vocab_size, padding_idx, smoothing):
    return criterion


+@utils.timer('TOTAL RUNTIME', sync_gpu=False)
 def main():
    """
    Launches data-parallel multi-gpu training.
    """
    args = parse_args()
+    device = utils.set_device(args.cuda, args.local_rank)
+    distributed = utils.init_distributed(args.cuda)
+    args.rank = utils.get_rank()

    if not args.cudnn:
        torch.backends.cudnn.enabled = False
-    if args.seed is not None:
-        torch.manual_seed(args.seed + args.rank)
-
-    # initialize distributed backend
-    distributed = args.world_size > 1
-    if distributed:
-        backend = 'nccl' if args.cuda else 'gloo'
-        dist.init_process_group(backend=backend, rank=args.rank,
-                                init_method=args.dist_url,
-                                world_size=args.world_size)

    # create directory for results
    save_path = os.path.join(args.results_dir, args.save)
@ -203,20 +268,39 @@ def main():
    os.makedirs(save_path, exist_ok=True)

    # setup logging
-    log_filename = f'log_gpu_{args.rank}.log'
-    setup_logging(os.path.join(save_path, log_filename))
+    log_filename = f'log_rank_{utils.get_rank()}.log'
+    utils.setup_logging(os.path.join(save_path, log_filename))
+
+    if args.env:
+        utils.log_env_info()

    logging.info(f'Saving results to: {save_path}')
    logging.info(f'Run arguments: {args}')

-    if args.cuda:
-        torch.cuda.set_device(args.rank)
+    # automatically set train_iter_size based on train_global_batch_size,
+    # world_size and per-worker train_batch_size
+    if args.train_global_batch_size is not None:
+        global_bs = args.train_global_batch_size
+        bs = args.train_batch_size
+        world_size = utils.get_world_size()
+        assert global_bs % (bs * world_size) == 0
+        args.train_iter_size = global_bs // (bs * world_size)
+        logging.info(f'Global batch size was set in the config, '
+                     f'Setting train_iter_size to {args.train_iter_size}')
+
+    worker_seeds, shuffling_seeds = utils.setup_seeds(args.seed, args.epochs,
+                                                      device)
+    worker_seed = worker_seeds[args.rank]
+    logging.info(f'Worker {args.rank} is using worker seed: {worker_seed}')
+    torch.manual_seed(worker_seed)

    # build tokenizer
-    tokenizer = Tokenizer(os.path.join(args.dataset_dir, config.VOCAB_FNAME))
+    pad_vocab = utils.pad_vocabulary(args.math)
+    tokenizer = Tokenizer(os.path.join(args.dataset_dir, config.VOCAB_FNAME),
+                          pad_vocab)

    # build datasets
-    train_data = ParallelDataset(
+    train_data = LazyParallelDataset(
        src_fname=os.path.join(args.dataset_dir, config.SRC_TRAIN_FNAME),
        tgt_fname=os.path.join(args.dataset_dir, config.TGT_TRAIN_FNAME),
        tokenizer=tokenizer,
@ -238,45 +322,59 @@ def main():
        tokenizer=tokenizer,
        min_len=args.min_length_test,
        max_len=args.max_length_test,
-        sort=False)
+        sort=True)

    vocab_size = tokenizer.vocab_size

    # build GNMT model
-    model_config = dict(vocab_size=vocab_size, math=args.math,
-                        **literal_eval(args.model_config))
-    model = GNMT(**model_config)
+    model_config = {'hidden_size': args.hidden_size,
+                    'num_layers': args.num_layers,
+                    'dropout': args.dropout, 'batch_first': False,
+                    'share_embedding': args.share_embedding}
+    model = GNMT(vocab_size=vocab_size, **model_config)
    logging.info(model)

    batch_first = model.batch_first

    # define loss function (criterion) and optimizer
    criterion = build_criterion(vocab_size, config.PAD, args.smoothing)
-    opt_config = literal_eval(args.optimization_config)
-    logging.info(f'Training optimizer: {opt_config}')
+
+    opt_config = {'optimizer': args.optimizer, 'lr': args.lr}
+    opt_config.update(literal_eval(args.optimizer_extra))
+    logging.info(f'Training optimizer config: {opt_config}')
+
+    scheduler_config = {'warmup_steps': args.warmup_steps,
+                        'remain_steps': args.remain_steps,
+                        'decay_interval': args.decay_interval,
+                        'decay_steps': args.decay_steps,
+                        'decay_factor': args.decay_factor}
+
+    logging.info(f'Training LR schedule config: {scheduler_config}')

    num_parameters = sum([l.nelement() for l in model.parameters()])
    logging.info(f'Number of parameters: {num_parameters}')

+    batching_opt = {'shard_size': args.shard_size,
+                    'num_buckets': args.num_buckets}
    # get data loaders
-    train_loader = train_data.get_loader(batch_size=args.batch_size,
+    train_loader = train_data.get_loader(batch_size=args.train_batch_size,
+                                         seeds=shuffling_seeds,
                                         batch_first=batch_first,
                                         shuffle=True,
-                                         bucketing=args.bucketing,
-                                         num_workers=args.workers,
-                                         drop_last=True)
+                                         batching=args.batching,
+                                         batching_opt=batching_opt,
+                                         num_workers=args.train_loader_workers)

    val_loader = val_data.get_loader(batch_size=args.val_batch_size,
                                     batch_first=batch_first,
                                     shuffle=False,
-                                     num_workers=args.workers,
-                                     drop_last=False)
+                                     num_workers=args.val_loader_workers)

    test_loader = test_data.get_loader(batch_size=args.test_batch_size,
                                       batch_first=batch_first,
                                       shuffle=False,
-                                       num_workers=args.workers,
-                                       drop_last=False)
+                                       pad=True,
+                                       num_workers=args.test_loader_workers)

    translator = Translator(model=model,
                            tokenizer=tokenizer,
@ -293,13 +391,19 @@ def main():
                            save_path=args.save_path)

    # create trainer
+    total_train_iters = len(train_loader) // args.train_iter_size * args.epochs
+    save_info = {'model_config': model_config, 'config': args, 'tokenizer':
+                 tokenizer.get_state()}
    trainer_options = dict(
        criterion=criterion,
        grad_clip=args.grad_clip,
+        iter_size=args.train_iter_size,
        save_path=save_path,
        save_freq=args.save_freq,
-        save_info={'config': args, 'tokenizer': tokenizer},
+        save_info=save_info,
        opt_config=opt_config,
+        scheduler_config=scheduler_config,
+        train_iterations=total_train_iters,
        batch_first=batch_first,
        keep_checkpoints=args.keep_checkpoints,
        math=args.math,
@ -325,47 +429,60 @@ def main():

    # training loop
    best_loss = float('inf')
+    break_training = False
+    test_bleu = None
    for epoch in range(args.start_epoch, args.epochs):
        logging.info(f'Starting epoch {epoch}')

-        if distributed:
-            train_loader.sampler.set_epoch(epoch)
+        train_loader.sampler.set_epoch(epoch)

        trainer.epoch = epoch
        train_loss, train_perf = trainer.optimize(train_loader)

        # evaluate on validation set
-        if args.rank == 0 and not args.disable_eval:
+        if args.eval:
            logging.info(f'Running validation on dev set')
            val_loss, val_perf = trainer.evaluate(val_loader)

            # remember best prec@1 and save checkpoint
-            is_best = val_loss < best_loss
-            best_loss = min(val_loss, best_loss)
-            trainer.save(save_all=args.save_all, is_best=is_best)
+            if args.rank == 0:
+                is_best = val_loss < best_loss
+                best_loss = min(val_loss, best_loss)
+                trainer.save(save_all=args.save_all, is_best=is_best)

-        break_training = False
-        if not args.disable_eval:
+        if args.eval:
+            utils.barrier()
            test_bleu, break_training = translator.run(calc_bleu=True,
                                                       epoch=epoch)

-        if args.rank == 0 and not args.disable_eval:
-            logging.info(f'Summary: Epoch: {epoch}\t'
-                         f'Training Loss: {train_loss:.4f}\t'
-                         f'Validation Loss: {val_loss:.4f}\t'
-                         f'Test BLEU: {test_bleu:.2f}')
-            logging.info(f'Performance: Epoch: {epoch}\t'
-                         f'Training: {train_perf:.0f} Tok/s\t'
-                         f'Validation: {val_perf:.0f} Tok/s')
-        else:
-            logging.info(f'Summary: Epoch: {epoch}\t'
-                         f'Training Loss {train_loss:.4f}')
-            logging.info(f'Performance: Epoch: {epoch}\t'
-                         f'Training: {train_perf:.0f} Tok/s')
+        acc_log = []
+        acc_log += [f'Summary: Epoch: {epoch}']
+        acc_log += [f'Training Loss: {train_loss:.4f}']
+        if args.eval:
+            acc_log += [f'Validation Loss: {val_loss:.4f}']
+            acc_log += [f'Test BLEU: {test_bleu:.2f}']
+
+        perf_log = []
+        perf_log += [f'Performance: Epoch: {epoch}']
+        perf_log += [f'Training: {train_perf:.0f} Tok/s']
+        if args.eval:
+            perf_log += [f'Validation: {val_perf:.0f} Tok/s']
+
+        if args.rank == 0:
+            logging.info('\t'.join(acc_log))
+            logging.info('\t'.join(perf_log))

        logging.info(f'Finished epoch {epoch}')
        if break_training:
            break

+    utils.barrier()
+    passed = utils.benchmark(test_bleu, args.target_bleu,
+                             train_perf, args.target_perf)
+    return passed
+
+
 if __name__ == '__main__':
-    main()
+    passed = main()
+    if not passed:
+        sys.exit(1)
--- a/translate.py
+++ b/translate.py
@ -1,39 +1,61 @@
 #!/usr/bin/env python
-import logging
 import argparse
+import logging
+import os
 import warnings
 from ast import literal_eval
+from itertools import product

 import torch
+import torch.distributed as dist

-from seq2seq.models.gnmt import GNMT
-from seq2seq.inference.inference import Translator
+import seq2seq.utils as utils
 from seq2seq.data.dataset import TextDataset
+from seq2seq.data.tokenizer import Tokenizer
+from seq2seq.inference.inference import Translator
+from seq2seq.models.gnmt import GNMT
+from seq2seq.utils import setup_logging


 def parse_args():
    """
    Parse commandline arguments.
    """
-    parser = argparse.ArgumentParser(description='GNMT Translate',
-                                     formatter_class=argparse.ArgumentDefaultsHelpFormatter)
-    # data
+    def exclusive_group(group, name, default, help):
+        destname = name.replace('-', '_')
+        subgroup = group.add_mutually_exclusive_group(required=False)
+        subgroup.add_argument(f'--{name}', dest=f'{destname}',
+                              action='store_true',
+                              help=f'{help} (use \'--no-{name}\' to disable)')
+        subgroup.add_argument(f'--no-{name}', dest=f'{destname}',
+                              action='store_false', help=argparse.SUPPRESS)
+        subgroup.set_defaults(**{destname: default})
+
+    parser = argparse.ArgumentParser(
+        description='GNMT Translate',
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter)
+
+    # dataset
    dataset = parser.add_argument_group('data setup')
    dataset.add_argument('--dataset-dir', default='data/wmt16_de_en/',
-                         help='path to directory with training/validation data')
+                         help='path to directory with training/test data')
    dataset.add_argument('-i', '--input', required=True,
                         help='full path to the input file (tokenized)')
    dataset.add_argument('-o', '--output', required=True,
                         help='full path to the output file (tokenized)')
    dataset.add_argument('-r', '--reference', default=None,
-                         help='full path to the reference file (for sacrebleu)')
+                         help='full path to the file with reference \
+                         translations (for sacrebleu)')
    dataset.add_argument('-m', '--model', required=True,
                         help='full path to the model checkpoint file')
+    exclusive_group(group=dataset, name='sort', default=True,
+                    help='sorts dataset by sequence length')
+
    # parameters
    params = parser.add_argument_group('inference setup')
-    params.add_argument('--batch-size', default=128, type=int,
-                        help='batch size')
-    params.add_argument('--beam-size', default=5, type=int,
+    params.add_argument('--batch-size', nargs='+', default=[128], type=int,
+                        help='batch size per GPU')
+    params.add_argument('--beam-size', nargs='+', default=[5], type=int,
                        help='beam size')
    params.add_argument('--max-seq-len', default=80, type=int,
                        help='maximum generated sequence length')
@ -45,16 +67,18 @@ def parse_args():
                        help='length normalization constant')
    # general setup
    general = parser.add_argument_group('general setup')
-    general.add_argument('--math', default='fp16', choices=['fp32', 'fp16'],
-                         help='arithmetic type')
+    general.add_argument('--math', nargs='+', default=['fp16'],
+                         choices=['fp16', 'fp32'], help='arithmetic type')

-    bleu_parser = general.add_mutually_exclusive_group(required=False)
-    bleu_parser.add_argument('--bleu', dest='bleu', action='store_true',
-                             help='compares with reference and computes BLEU \
-                             (use \'--no-bleu\' to disable)')
-    bleu_parser.add_argument('--no-bleu', dest='bleu', action='store_false',
-                             help=argparse.SUPPRESS)
-    bleu_parser.set_defaults(bleu=True)
+    exclusive_group(group=general, name='env', default=True,
+                    help='print info about execution env')
+    exclusive_group(group=general, name='bleu', default=True,
+                    help='compares with reference translation and computes \
+                    BLEU')
+    exclusive_group(group=general, name='cuda', default=True,
+                    help='enables cuda')
+    exclusive_group(group=general, name='cudnn', default=True,
+                    help='enables cudnn')

    batch_first_parser = general.add_mutually_exclusive_group(required=False)
    batch_first_parser.add_argument('--batch-first', dest='batch_first',
@ -67,62 +91,27 @@ def parse_args():
                                    format for RNNs')
    batch_first_parser.set_defaults(batch_first=True)

-    cuda_parser = general.add_mutually_exclusive_group(required=False)
-    cuda_parser.add_argument('--cuda', dest='cuda', action='store_true',
-                             help='enables cuda (use \'--no-cuda\' to disable)')
-    cuda_parser.add_argument('--no-cuda', dest='cuda', action='store_false',
-                             help=argparse.SUPPRESS)
-    cuda_parser.set_defaults(cuda=True)
-
-    cudnn_parser = general.add_mutually_exclusive_group(required=False)
-    cudnn_parser.add_argument('--cudnn', dest='cudnn', action='store_true',
-                              help='enables cudnn (use \'--no-cudnn\' to disable)')
-    cudnn_parser.add_argument('--no-cudnn', dest='cudnn', action='store_false',
-                              help=argparse.SUPPRESS)
-    cudnn_parser.set_defaults(cudnn=True)
-
    general.add_argument('--print-freq', '-p', default=1, type=int,
                         help='print log every PRINT_FREQ batches')

+    # distributed
+    distributed = parser.add_argument_group('distributed setup')
+    distributed.add_argument('--rank', default=0, type=int,
+                             help='global rank of the process, do not set!')
+    distributed.add_argument('--local_rank', default=0, type=int,
+                             help='local rank of the process, do not set!')
+
    args = parser.parse_args()

    if args.bleu and args.reference is None:
        parser.error('--bleu requires --reference')

+    if 'fp16' in args.math and not args.cuda:
+        parser.error('--math fp16 requires --cuda')
+
    return args


-def checkpoint_from_distributed(state_dict):
-    """
-    Checks whether checkpoint was generated by DistributedDataParallel. DDP
-    wraps model in additional "module.", it needs to be unwrapped for single
-    GPU inference.
-
-    :param state_dict: model's state dict
-    """
-    ret = False
-    for key, _ in state_dict.items():
-        if key.find('module.') != -1:
-            ret = True
-            break
-    return ret
-
-
-def unwrap_distributed(state_dict):
-    """
-    Unwraps model from DistributedDataParallel.
-    DDP wraps model in additional "module.", it needs to be removed for single
-    GPU inference.
-
-    :param state_dict: model's state dict
-    """
-    new_state_dict = {}
-    for key, value in state_dict.items():
-        new_key = key.replace('module.', '')
-        new_state_dict[new_key] = value
-    return new_state_dict
-
-
 def main():
    """
    Launches translation (inference).
@ -130,26 +119,17 @@ def main():
    with length normalization and coverage penalty.
    """
    args = parse_args()
+    utils.set_device(args.cuda, args.local_rank)
+    utils.init_distributed(args.cuda)
+    setup_logging()

-    logging.basicConfig(level=logging.DEBUG,
-                        format="%(asctime)s - %(levelname)s - %(message)s",
-                        datefmt="%Y-%m-%d %H:%M:%S",
-                        filename='log.log',
-                        filemode='w')
-    console = logging.StreamHandler()
-    console.setLevel(logging.INFO)
-    formatter = logging.Formatter('%(message)s')
-    console.setFormatter(formatter)
-    logging.getLogger('').addHandler(console)
+    if args.env:
+        utils.log_env_info()

-    logging.info(args)
+    logging.info(f'Run arguments: {args}')

-    if args.cuda:
-        torch.cuda.set_device(0)
    if not args.cuda and torch.cuda.is_available():
        warnings.warn('cuda is available but not enabled')
-    if args.math == 'fp16' and not args.cuda:
-        raise RuntimeError('fp16 requires cuda')
    if not args.cudnn:
        torch.backends.cudnn.enabled = False

@ -157,57 +137,57 @@ def main():
    checkpoint = torch.load(args.model, map_location={'cuda:0': 'cpu'})

    # build GNMT model
-    tokenizer = checkpoint['tokenizer']
+    tokenizer = Tokenizer()
+    tokenizer.set_state(checkpoint['tokenizer'])
    vocab_size = tokenizer.vocab_size
-    model_config = dict(vocab_size=vocab_size, math=checkpoint['config'].math,
-                        **literal_eval(checkpoint['config'].model_config))
+    model_config = checkpoint['model_config']
    model_config['batch_first'] = args.batch_first
-    model = GNMT(**model_config)
+    model = GNMT(vocab_size=vocab_size, **model_config)
+    model.load_state_dict(checkpoint['state_dict'])

-    state_dict = checkpoint['state_dict']
-    if checkpoint_from_distributed(state_dict):
-        state_dict = unwrap_distributed(state_dict)
+    for (math, batch_size, beam_size) in product(args.math, args.batch_size,
+                                                 args.beam_size):
+        logging.info(f'math: {math}, batch size: {batch_size}, '
+                     f'beam size: {beam_size}')
+        if math == 'fp32':
+            dtype = torch.FloatTensor
+        if math == 'fp16':
+            dtype = torch.HalfTensor
+        model.type(dtype)

-    model.load_state_dict(state_dict)
+        if args.cuda:
+            model = model.cuda()
+        model.eval()

-    if args.math == 'fp32':
-        dtype = torch.FloatTensor
-    if args.math == 'fp16':
-        dtype = torch.HalfTensor
+        # construct the dataset
+        test_data = TextDataset(src_fname=args.input,
+                                tokenizer=tokenizer,
+                                sort=args.sort)

-    model.type(dtype)
-    if args.cuda:
-        model = model.cuda()
-    model.eval()
+        # build the data loader
+        test_loader = test_data.get_loader(batch_size=batch_size,
+                                           batch_first=args.batch_first,
+                                           shuffle=False,
+                                           pad=True,
+                                           num_workers=0)

-    # construct the dataset
-    test_data = TextDataset(src_fname=args.input,
-                            tokenizer=tokenizer,
-                            sort=False)
+        # build the translator object
+        translator = Translator(model=model,
+                                tokenizer=tokenizer,
+                                loader=test_loader,
+                                beam_size=beam_size,
+                                max_seq_len=args.max_seq_len,
+                                len_norm_factor=args.len_norm_factor,
+                                len_norm_const=args.len_norm_const,
+                                cov_penalty_factor=args.cov_penalty_factor,
+                                cuda=args.cuda,
+                                print_freq=args.print_freq,
+                                dataset_dir=args.dataset_dir)

-    # build the data loader
-    test_loader = test_data.get_loader(batch_size=args.batch_size,
-                                       batch_first=args.batch_first,
-                                       shuffle=False,
-                                       num_workers=0,
-                                       drop_last=False)
+        # execute the inference
+        translator.run(calc_bleu=args.bleu, eval_path=args.output,
+                       reference_path=args.reference, summary=True)

-    # build the translator object
-    translator = Translator(model=model,
-                            tokenizer=tokenizer,
-                            loader=test_loader,
-                            beam_size=args.beam_size,
-                            max_seq_len=args.max_seq_len,
-                            len_norm_factor=args.len_norm_factor,
-                            len_norm_const=args.len_norm_const,
-                            cov_penalty_factor=args.cov_penalty_factor,
-                            cuda=args.cuda,
-                            print_freq=args.print_freq,
-                            dataset_dir=args.dataset_dir)
-
-    # execute the inference
-    translator.run(calc_bleu=args.bleu, eval_path=args.output,
-                   reference_path=args.reference, summary=True)

 if __name__ == '__main__':
    main()