Merge pull request #857 from NVIDIA/gh/release

Gh/release
2021-03-04 16:17:59 +01:00 · 2021-03-04 16:17:59 +01:00 · 7fe99babad
parent 9f696429da 097c7ff917
commit 7fe99babad
113 changed files with 23742 additions and 712 deletions
--- a/PyTorch/Classification/ConvNets/Dockerfile
+++ b/PyTorch/Classification/ConvNets/Dockerfile
@ -1,4 +1,4 @@
-ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:20.07-py3
+ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:20.12-py3
 FROM ${FROM_IMAGE_NAME}

 ADD requirements.txt /workspace/
--- a/PyTorch/Classification/ConvNets/README.md
+++ b/PyTorch/Classification/ConvNets/README.md
@ -1,4 +1,4 @@
-# Convolutional Networks for Image Classification in PyTorch
+# Convolutional Network for Image Classification in PyTorch

 In this repository you will find implementations of various image classification models.

@ -9,7 +9,7 @@ Detailed information on each model can be found here:
 * [Models](#models)
 * [Validation accuracy results](#validation-accuracy-results)
 * [Training performance results](#training-performance-results)
-  * [Training performance: NVIDIA DGX A100 (8x A100 40GB)](#training-performance-nvidia-dgx-a100-8x-a100-40gb)
+  * [Training performance: NVIDIA DGX A100 (8x A100 80GB)](#training-performance-nvidia-dgx-a100-8x-a100-80gb)
  * [Training performance: NVIDIA DGX-1 16GB (8x V100 16GB)](#training-performance-nvidia-dgx-1-16gb-8x-v100-16gb)
  * [Training performance: NVIDIA DGX-2 (16x V100 32GB)](#training-performance-nvidia-dgx-2-16x-v100-32gb)
 * [Model comparison](#model-comparison)
@ -38,22 +38,20 @@ in the corresponding model's README.
 The following table shows the validation accuracy results of the
 three classification models side-by-side.

-
-| **arch** | **AMP Top1** | **AMP Top5** | **FP32 Top1** | **FP32 Top5** |
-|:-:|:-:|:-:|:-:|:-:|
-| resnet50 | 78.46 | 94.15 | 78.50 | 94.11 |
-| resnext101-32x4d | 80.08 | 94.89 | 80.14 | 95.02 |
-| se-resnext101-32x4d | 81.01 | 95.52 | 81.12 | 95.54 |
-
+|      **Model**      | **Mixed Precision Top1** | **Mixed Precision Top5** | **32 bit Top1** | **32 bit Top5** |
+|:-------------------:|:------------------------:|:------------------------:|:---------------:|:---------------:|
+|      resnet50       |          78.60           |          94.19           |      78.69      |      94.16      |
+|  resnext101-32x4d   |          80.43           |          95.06           |      80.40      |      95.04      |
+| se-resnext101-32x4d |          81.00           |          95.48           |      81.09      |      95.45      |

 ## Training performance results

-### Training performance: NVIDIA DGX A100 (8x A100 40GB)
+### Training performance: NVIDIA DGX A100 (8x A100 80GB)


 Our results were obtained by running the applicable
-training scripts in the pytorch-20.06 NGC container
-on NVIDIA DGX A100 with (8x A100 40GB) GPUs.
+training scripts in the pytorch-20.12 NGC container
+on NVIDIA DGX A100 with (8x A100 80GB) GPUs.
 Performance numbers (in images per second)
 were averaged over an entire training epoch.
 The specific training script that was run is documented
@ -63,21 +61,16 @@ The following table shows the training accuracy results of the
 three classification models side-by-side.


-|      **arch**       | **Mixed Precision** |   **TF32**    | **Mixed Precision Speedup** |
-|:-------------------:|:-------------------:|:-------------:|:---------------------------:|
-|      resnet50       |    9488.39 img/s    | 5322.10 img/s |            1.78x            |
-|  resnext101-32x4d   |    6758.98 img/s    | 2353.25 img/s |            2.87x            |
-| se-resnext101-32x4d |    4670.72 img/s    | 2011.21 img/s |            2.32x            |
-
-ResNeXt and SE-ResNeXt use [NHWC data layout](https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html) when training using Mixed Precision,
-which improves the model performance. We are currently working on adding it for ResNet.
-
+|      **Model**      | **Mixed Precision** |  **TF32**  | **Mixed Precision Speedup** |
+|:-------------------:|:-------------------:|:----------:|:---------------------------:|
+|      resnet50       |     15977 img/s     | 7365 img/s |           2.16 x            |
+|  resnext101-32x4d   |     7399 img/s      | 3193 img/s |           2.31 x            |
+| se-resnext101-32x4d |     5248 img/s      | 2665 img/s |           1.96 x            |

 ### Training performance: NVIDIA DGX-1 16G (8x V100 16GB)

-
 Our results were obtained by running the applicable
-training scripts in the pytorch-20.06 NGC container
+training scripts in the pytorch-20.12 NGC container
 on NVIDIA DGX-1 with (8x V100 16GB) GPUs.
 Performance numbers (in images per second)
 were averaged over an entire training epoch.
@ -87,16 +80,11 @@ in the corresponding model's README.
 The following table shows the training accuracy results of the
 three classification models side-by-side.

-
-|      **arch**       | **Mixed Precision** |   **FP32**    | **Mixed Precision Speedup** |
-|:-------------------:|:-------------------:|:-------------:|:---------------------------:|
-|      resnet50       |    6565.61 img/s    | 2869.19 img/s |            2.29x            |
-|  resnext101-32x4d   |    3922.74 img/s    | 1136.30 img/s |            3.45x            |
-| se-resnext101-32x4d |    2651.13 img/s    | 982.78 img/s  |            2.70x            |
-
-ResNeXt and SE-ResNeXt use [NHWC data layout](https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html) when training using Mixed Precision,
-which improves the model performance. We are currently working on adding it for ResNet.
-
+|      **Model**      | **Mixed Precision** |  **FP32**  | **Mixed Precision Speedup** |
+|:-------------------:|:-------------------:|:----------:|:---------------------------:|
+|      resnet50       |     7608 img/s      | 2851 img/s |           2.66 x            |
+|  resnext101-32x4d   |     3742 img/s      | 1117 img/s |           3.34 x            |
+| se-resnext101-32x4d |     2716 img/s      | 994 img/s  |           2.73 x            |

 ## Model Comparison

@ -111,8 +99,6 @@ Dot size indicates number of trainable parameters.
 ### Latency vs Throughput on different batch sizes
 ![LATvsTHR](./img/LATvsTHR.png)

-Plot describes relationship between 
-inference latency, throughput and batch size 
+Plot describes relationship between
+inference latency, throughput and batch size
 for the implemented models.
-
-
--- a/PyTorch/Classification/ConvNets/checkpoint2model.py
+++ b/PyTorch/Classification/ConvNets/checkpoint2model.py
@ -30,7 +30,7 @@ if __name__ == "__main__":
    add_parser_arguments(parser)
    args = parser.parse_args()

-    checkpoint = torch.load(args.checkpoint_path)
+    checkpoint = torch.load(args.checkpoint_path, map_location=torch.device('cpu'))

    model_state_dict = {
        k[len("module.") :] if "module." in k else k: v
@ -39,4 +39,4 @@ if __name__ == "__main__":

    print(f"Loaded {checkpoint['arch']} : {checkpoint['best_prec1']}")

-    torch.save(model_state_dict, args.weight_path)
+    torch.save(model_state_dict, args.weight_path.format(arch=checkpoint['arch'][0], acc = checkpoint['best_prec1']))
--- a/PyTorch/Classification/ConvNets/classify.py
+++ b/PyTorch/Classification/ConvNets/classify.py
@ -16,19 +16,12 @@ import argparse
 import numpy as np
 import json
 import torch
+from torch.cuda.amp import autocast
 import torch.backends.cudnn as cudnn
 import torchvision.transforms as transforms
 import image_classification.resnet as models
 from image_classification.dataloaders import load_jpeg_from_file

-try:
-    from apex.fp16_utils import *
-    from apex import amp
-except ImportError:
-    raise ImportError(
-        "Please install apex from https://www.github.com/nvidia/apex to run this example."
-    )
-

 def add_parser_arguments(parser):
    model_names = models.resnet_versions.keys()
@ -52,7 +45,7 @@ def add_parser_arguments(parser):
    )
    parser.add_argument("--weights", metavar="<path>", help="file with model weights")
    parser.add_argument(
-        "--precision", metavar="PREC", default="FP16", choices=["AMP", "FP16", "FP32"]
+        "--precision", metavar="PREC", default="AMP", choices=["AMP", "FP32"]
    )
    parser.add_argument("--image", metavar="<path>", help="path to classified image")

@ -63,30 +56,28 @@ def main(args):

    if args.weights is not None:
        weights = torch.load(args.weights)
-        #Temporary fix to allow NGC checkpoint loading
+        # Temporary fix to allow NGC checkpoint loading
        weights = {
            k.replace("module.", ""): v for k, v in weights.items()
        }
        model.load_state_dict(weights)

    model = model.cuda()
-
-    if args.precision in ["AMP", "FP16"]:
-        model = network_to_half()
-
    model.eval()

-    with torch.no_grad():
-        input = load_jpeg_from_file(
-            args.image, cuda=True, fp16=args.precision != "FP32"
-        )
+    input = load_jpeg_from_file(
+        args.image, cuda=True
+    )

-        output = torch.nn.functional.softmax(model(input), dim=1).cpu().view(-1).numpy()
-        top5 = np.argsort(output)[-5:][::-1]
+    with torch.no_grad(), autocast(enabled = args.precision == "AMP"):
+        output = torch.nn.functional.softmax(model(input), dim=1)

-        print(args.image)
-        for c, v in zip(imgnet_classes[top5], output[top5]):
-            print(f"{c}: {100*v:.1f}%")
+    output = output.float().cpu().view(-1).numpy()
+    top5 = np.argsort(output)[-5:][::-1]
+
+    print(args.image)
+    for c, v in zip(imgnet_classes[top5], output[top5]):
+        print(f"{c}: {100*v:.1f}%")


 if __name__ == "__main__":
--- a/PyTorch/Classification/ConvNets/configs.yml
+++ b/PyTorch/Classification/ConvNets/configs.yml
@ -0,0 +1,183 @@
+precision:
+    AMP:
+        static_loss_scale: 128
+        amp: True
+    FP32:
+        amp: False
+    TF32:
+        amp: False
+
+platform:
+    T4:
+        workers: 8
+    DGX1V:
+        workers: 8
+    DGX2V:
+        workers: 8
+    DGXA100:
+        workers: 16
+
+mode:
+    benchmark_training: &benchmark_training
+        print_freq: 1
+        epochs: 3
+        training_only: True
+        evaluate: False
+        save_checkpoints: False
+    benchmark_training_short:
+        <<: *benchmark_training
+        epochs: 1
+        data_backend: syntetic
+        prof: 100
+    benchmark_inference: &benchmark_inference
+        print_freq: 1
+        epochs: 1
+        training_only: False
+        evaluate: True
+        save_checkpoints: False
+    convergence:
+        print_freq: 20
+        training_only: False
+        evaluate: False
+        save_checkpoints: True
+
+anchors:
+    # ResNet_like params: {{{
+    resnet_params: &resnet_params
+        model_config: fanin
+        label_smoothing: 0.1
+        mixup: 0.2
+        lr_schedule: cosine
+        momentum: 0.875
+        warmup: 8
+        epochs: 250
+        data_backend: pytorch
+        num_classes: 1000
+        image_size: 224
+    resnet_params_896: &resnet_params_896
+        <<: *resnet_params
+        optimizer_batch_size: 896
+        lr: 0.896
+        weight_decay: 6.103515625e-05
+    resnet_params_1k: &resnet_params_1k
+        <<: *resnet_params
+        optimizer_batch_size: 1024
+        lr: 1.024
+        weight_decay: 6.103515625e-05
+    resnet_params_2k: &resnet_params_2k
+        <<: *resnet_params
+        optimizer_batch_size: 2048
+        lr: 2.048
+        weight_decay: 3.0517578125e-05
+    resnet_params_4k: &resnet_params_4k
+        <<: *resnet_params
+        optimizer_batch_size: 4096
+        lr: 4.086
+        weight_decay: 3.0517578125e-05
+    # }}}
+
+models:
+    resnet50: # {{{
+        DGX1V:
+            AMP:
+                <<: *resnet_params_2k
+                arch: resnet50
+                batch_size: 256
+                memory_format: nhwc
+            FP32:
+                <<: *resnet_params_2k
+                batch_size: 112
+        DGX2V:
+            AMP:
+                <<: *resnet_params_4k
+                arch: resnet50
+                batch_size: 256
+                memory_format: nhwc
+            FP32:
+                <<: *resnet_params_4k
+                arch: resnet50
+                batch_size: 256
+        DGXA100:
+            AMP:
+                <<: *resnet_params_2k
+                arch: resnet50
+                batch_size: 256
+                memory_format: nhwc
+            TF32:
+                <<: *resnet_params_2k
+                arch: resnet50
+                batch_size: 256
+        T4:
+            AMP:
+                <<: *resnet_params_2k
+                arch: resnet50
+                batch_size: 256
+                memory_format: nhwc
+            FP32:
+                <<: *resnet_params_2k
+                batch_size: 128
+    # }}}
+    resnext101-32x4d: # {{{
+        DGX1V:
+            AMP:
+                <<: *resnet_params_1k
+                arch: resnext101-32x4d
+                batch_size: 128
+                memory_format: nhwc
+            FP32:
+                <<: *resnet_params_1k
+                arch: resnext101-32x4d
+                batch_size: 64
+        DGXA100:
+            AMP:
+                <<: *resnet_params_1k
+                arch: resnext101-32x4d
+                batch_size: 128
+                memory_format: nhwc
+            TF32:
+                <<: *resnet_params_1k
+                arch: resnext101-32x4d
+                batch_size: 128
+        T4:
+            AMP:
+                <<: *resnet_params_1k
+                arch: resnext101-32x4d
+                batch_size: 128
+                memory_format: nhwc
+            FP32:
+                <<: *resnet_params_1k
+                arch: resnext101-32x4d
+                batch_size: 64
+    # }}}
+    se-resnext101-32x4d: # {{{
+        DGX1V:
+            AMP:
+                <<: *resnet_params_896
+                arch: se-resnext101-32x4d
+                batch_size: 112
+                memory_format: nhwc
+            FP32:
+                <<: *resnet_params_1k
+                arch: se-resnext101-32x4d
+                batch_size: 64
+        DGXA100:
+            AMP:
+                <<: *resnet_params_1k
+                arch: se-resnext101-32x4d
+                batch_size: 128
+                memory_format: nhwc
+            TF32:
+                <<: *resnet_params_1k
+                arch: se-resnext101-32x4d
+                batch_size: 128
+        T4:
+            AMP:
+                <<: *resnet_params_1k
+                arch: se-resnext101-32x4d
+                batch_size: 128
+                memory_format: nhwc
+            FP32:
+                <<: *resnet_params_1k
+                arch: se-resnext101-32x4d
+                batch_size: 64
+# }}}
--- a/PyTorch/Classification/ConvNets/image_classification/dataloaders.py
+++ b/PyTorch/Classification/ConvNets/image_classification/dataloaders.py
@ -50,7 +50,7 @@ except ImportError:
    )


-def load_jpeg_from_file(path, cuda=True, fp16=False):
+def load_jpeg_from_file(path, cuda=True):
    img_transforms = transforms.Compose(
        [transforms.Resize(256), transforms.CenterCrop(224), transforms.ToTensor()]
    )
@ -67,12 +67,7 @@ def load_jpeg_from_file(path, cuda=True, fp16=False):
            mean = mean.cuda()
            std = std.cuda()
            img = img.cuda()
-        if fp16:
-            mean = mean.half()
-            std = std.half()
-            img = img.half()
-        else:
-            img = img.float()
+        img = img.float()

        input = img.unsqueeze(0).sub_(mean).div_(std)

@ -98,6 +93,7 @@ class HybridTrainPipe(Pipeline):
            shard_id=rank,
            num_shards=world_size,
            random_shuffle=True,
+            pad_last_batch=True,
        )

        if dali_cpu:
@ -125,10 +121,9 @@ class HybridTrainPipe(Pipeline):

        self.cmnp = ops.CropMirrorNormalize(
            device="gpu",
-            output_dtype=types.FLOAT,
+            dtype=types.FLOAT,
            output_layout=types.NCHW,
            crop=(crop, crop),
-            image_type=types.RGB,
            mean=[0.485 * 255, 0.456 * 255, 0.406 * 255],
            std=[0.229 * 255, 0.224 * 255, 0.225 * 255],
        )
@ -160,16 +155,16 @@ class HybridValPipe(Pipeline):
            shard_id=rank,
            num_shards=world_size,
            random_shuffle=False,
+            pad_last_batch=True,
        )

        self.decode = ops.ImageDecoder(device="mixed", output_type=types.RGB)
        self.res = ops.Resize(device="gpu", resize_shorter=size)
        self.cmnp = ops.CropMirrorNormalize(
            device="gpu",
-            output_dtype=types.FLOAT,
+            dtype=types.FLOAT,
            output_layout=types.NCHW,
            crop=(crop, crop),
-            image_type=types.RGB,
            mean=[0.485 * 255, 0.456 * 255, 0.406 * 255],
            std=[0.229 * 255, 0.224 * 255, 0.225 * 255],
        )
@ -213,7 +208,6 @@ def get_dali_train_loader(dali_cpu=False):
        start_epoch=0,
        workers=5,
        _worker_init_fn=None,
-        fp16=False,
        memory_format=torch.contiguous_format,
    ):
        if torch.distributed.is_initialized():
@ -236,7 +230,7 @@ def get_dali_train_loader(dali_cpu=False):

        pipe.build()
        train_loader = DALIClassificationIterator(
-            pipe, size=int(pipe.epoch_size("Reader") / world_size)
+            pipe, reader_name="Reader", fill_last_batch=False
        )

        return (
@ -255,7 +249,6 @@ def get_dali_val_loader():
        one_hot,
        workers=5,
        _worker_init_fn=None,
-        fp16=False,
        memory_format=torch.contiguous_format,
    ):
        if torch.distributed.is_initialized():
@ -278,7 +271,7 @@ def get_dali_val_loader():

        pipe.build()
        val_loader = DALIClassificationIterator(
-            pipe, size=int(pipe.epoch_size("Reader") / world_size)
+            pipe, reader_name="Reader", fill_last_batch=False
        )

        return (
@ -317,7 +310,7 @@ def expand(num_classes, dtype, tensor):


 class PrefetchedWrapper(object):
-    def prefetched_loader(loader, num_classes, fp16, one_hot):
+    def prefetched_loader(loader, num_classes, one_hot):
        mean = (
            torch.tensor([0.485 * 255, 0.456 * 255, 0.406 * 255])
            .cuda()
@ -328,9 +321,6 @@ class PrefetchedWrapper(object):
            .cuda()
            .view(1, 3, 1, 1)
        )
-        if fp16:
-            mean = mean.half()
-            std = std.half()

        stream = torch.cuda.Stream()
        first = True
@ -339,14 +329,9 @@ class PrefetchedWrapper(object):
            with torch.cuda.stream(stream):
                next_input = next_input.cuda(non_blocking=True)
                next_target = next_target.cuda(non_blocking=True)
-                if fp16:
-                    next_input = next_input.half()
-                    if one_hot:
-                        next_target = expand(num_classes, torch.half, next_target)
-                else:
-                    next_input = next_input.float()
-                    if one_hot:
-                        next_target = expand(num_classes, torch.float, next_target)
+                next_input = next_input.float()
+                if one_hot:
+                    next_target = expand(num_classes, torch.float, next_target)

                next_input = next_input.sub_(mean).div_(std)

@ -361,9 +346,8 @@ class PrefetchedWrapper(object):

        yield input, target

-    def __init__(self, dataloader, start_epoch, num_classes, fp16, one_hot):
+    def __init__(self, dataloader, start_epoch, num_classes, one_hot):
        self.dataloader = dataloader
-        self.fp16 = fp16
        self.epoch = start_epoch
        self.one_hot = one_hot
        self.num_classes = num_classes
@ -376,7 +360,7 @@ class PrefetchedWrapper(object):
            self.dataloader.sampler.set_epoch(self.epoch)
        self.epoch += 1
        return PrefetchedWrapper.prefetched_loader(
-            self.dataloader, self.num_classes, self.fp16, self.one_hot
+            self.dataloader, self.num_classes, self.one_hot
        )

    def __len__(self):
@ -391,7 +375,6 @@ def get_pytorch_train_loader(
    start_epoch=0,
    workers=5,
    _worker_init_fn=None,
-    fp16=False,
    memory_format=torch.contiguous_format,
 ):
    traindir = os.path.join(data_path, "train")
@ -403,24 +386,24 @@ def get_pytorch_train_loader(
    )

    if torch.distributed.is_initialized():
-        train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
+        train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset, shuffle=True)
    else:
        train_sampler = None

    train_loader = torch.utils.data.DataLoader(
        train_dataset,
+        sampler=train_sampler,
        batch_size=batch_size,
        shuffle=(train_sampler is None),
        num_workers=workers,
        worker_init_fn=_worker_init_fn,
        pin_memory=True,
-        sampler=train_sampler,
        collate_fn=partial(fast_collate, memory_format),
        drop_last=True,
    )

    return (
-        PrefetchedWrapper(train_loader, start_epoch, num_classes, fp16, one_hot),
+        PrefetchedWrapper(train_loader, start_epoch, num_classes, one_hot),
        len(train_loader),
    )

@ -432,7 +415,6 @@ def get_pytorch_val_loader(
    one_hot,
    workers=5,
    _worker_init_fn=None,
-    fp16=False,
    memory_format=torch.contiguous_format,
 ):
    valdir = os.path.join(data_path, "val")
@ -441,7 +423,7 @@ def get_pytorch_val_loader(
    )

    if torch.distributed.is_initialized():
-        val_sampler = torch.utils.data.distributed.DistributedSampler(val_dataset)
+        val_sampler = torch.utils.data.distributed.DistributedSampler(val_dataset, shuffle=False)
    else:
        val_sampler = None

@ -449,20 +431,20 @@ def get_pytorch_val_loader(
        val_dataset,
        sampler=val_sampler,
        batch_size=batch_size,
-        shuffle=False,
+        shuffle=(val_sampler is None),
        num_workers=workers,
        worker_init_fn=_worker_init_fn,
        pin_memory=True,
        collate_fn=partial(fast_collate, memory_format),
+        drop_last=False,
    )

-    return PrefetchedWrapper(val_loader, 0, num_classes, fp16, one_hot), len(val_loader)
+    return PrefetchedWrapper(val_loader, 0, num_classes, one_hot), len(val_loader)


 class SynteticDataLoader(object):
    def __init__(
        self,
-        fp16,
        batch_size,
        num_classes,
        num_channels,
@ -483,8 +465,6 @@ class SynteticDataLoader(object):
        else:
            input_target = torch.randint(0, num_classes, (batch_size,))
        input_target = input_target.cuda()
-        if fp16:
-            input_data = input_data.half()

        self.input_data = input_data
        self.input_target = input_target
@ -502,19 +482,11 @@ def get_syntetic_loader(
    start_epoch=0,
    workers=None,
    _worker_init_fn=None,
-    fp16=False,
    memory_format=torch.contiguous_format,
 ):
    return (
        SynteticDataLoader(
-            fp16,
-            batch_size,
-            num_classes,
-            3,
-            224,
-            224,
-            one_hot,
-            memory_format=memory_format,
+            batch_size, num_classes, 3, 224, 224, one_hot, memory_format=memory_format
        ),
        -1,
    )
--- a/PyTorch/Classification/ConvNets/image_classification/training.py
+++ b/PyTorch/Classification/ConvNets/image_classification/training.py
@ -38,14 +38,8 @@ from . import resnet as models
 from . import utils
 import dllogger

-try:
-    from apex.parallel import DistributedDataParallel as DDP
-    from apex.fp16_utils import *
-    from apex import amp
-except ImportError:
-    raise ImportError(
-        "Please install apex from https://www.github.com/nvidia/apex to run this example."
-    )
+from torch.nn.parallel import DistributedDataParallel as DDP
+from torch.cuda.amp import autocast

 ACC_METADATA = {"unit": "%", "format": ":.2f"}
 IPS_METADATA = {"unit": "img/s", "format": ":.2f"}
@ -60,7 +54,6 @@ class ModelAndLoss(nn.Module):
        loss,
        pretrained_weights=None,
        cuda=True,
-        fp16=False,
        memory_format=torch.contiguous_format,
    ):
        super(ModelAndLoss, self).__init__()
@ -74,8 +67,6 @@ class ModelAndLoss(nn.Module):

        if cuda:
            model = model.cuda().to(memory_format=memory_format)
-        if fp16:
-            model = network_to_half(model)

        # define loss function (criterion) and optimizer
        criterion = loss()
@ -92,8 +83,8 @@ class ModelAndLoss(nn.Module):

        return loss, output

-    def distributed(self):
-        self.model = DDP(self.model)
+    def distributed(self, gpu_id):
+        self.model = DDP(self.model, device_ids=[gpu_id], output_device=gpu_id)

    def load_model_state(self, state):
        if not state is None:
@ -102,14 +93,11 @@ class ModelAndLoss(nn.Module):

 def get_optimizer(
    parameters,
-    fp16,
    lr,
    momentum,
    weight_decay,
    nesterov=False,
    state=None,
-    static_loss_scale=1.0,
-    dynamic_loss_scale=False,
    bn_weight_decay=False,
 ):

@ -138,13 +126,6 @@ def get_optimizer(
            weight_decay=weight_decay,
            nesterov=nesterov,
        )
-    if fp16:
-        optimizer = FP16_Optimizer(
-            optimizer,
-            static_loss_scale=static_loss_scale,
-            dynamic_loss_scale=dynamic_loss_scale,
-            verbose=False,
-        )

    if not state is None:
        optimizer.load_state_dict(state)
@ -227,36 +208,25 @@ def lr_exponential_policy(


 def get_train_step(
-    model_and_loss, optimizer, fp16, use_amp=False, batch_size_multiplier=1
+    model_and_loss, optimizer, scaler, use_amp=False, batch_size_multiplier=1
 ):
    def _step(input, target, optimizer_step=True):
        input_var = Variable(input)
        target_var = Variable(target)
-        loss, output = model_and_loss(input_var, target_var)
-        if torch.distributed.is_initialized():
-            reduced_loss = utils.reduce_tensor(loss.data)
-        else:
-            reduced_loss = loss.data

-        if fp16:
-            optimizer.backward(loss)
-        elif use_amp:
-            with amp.scale_loss(loss, optimizer) as scaled_loss:
-                scaled_loss.backward()
-        else:
-            loss.backward()
+        with autocast(enabled=use_amp):
+            loss, output = model_and_loss(input_var, target_var)
+            loss /= batch_size_multiplier
+            if torch.distributed.is_initialized():
+                reduced_loss = utils.reduce_tensor(loss.data)
+            else:
+                reduced_loss = loss.data
+
+        scaler.scale(loss).backward()

        if optimizer_step:
-            opt = (
-                optimizer.optimizer
-                if isinstance(optimizer, FP16_Optimizer)
-                else optimizer
-            )
-            for param_group in opt.param_groups:
-                for param in param_group["params"]:
-                    param.grad /= batch_size_multiplier
-
-            optimizer.step()
+            scaler.step(optimizer)
+            scaler.update()
            optimizer.zero_grad()

        torch.cuda.synchronize()
@ -270,10 +240,11 @@ def train(
    train_loader,
    model_and_loss,
    optimizer,
+    scaler,
    lr_scheduler,
-    fp16,
    logger,
    epoch,
+    timeout_handler,
    use_amp=False,
    prof=-1,
    batch_size_multiplier=1,
@ -315,7 +286,7 @@ def train(
    step = get_train_step(
        model_and_loss,
        optimizer,
-        fp16,
+        scaler=scaler,
        use_amp=use_amp,
        batch_size_multiplier=batch_size_multiplier,
    )
@ -342,31 +313,33 @@ def train(
        it_time = time.time() - end

        if logger is not None:
-            logger.log_metric("train.loss", to_python_float(loss), bs)
+            logger.log_metric("train.loss", loss.item(), bs)
            logger.log_metric("train.compute_ips", calc_ips(bs, it_time - data_time))
            logger.log_metric("train.total_ips", calc_ips(bs, it_time))
            logger.log_metric("train.data_time", data_time)
            logger.log_metric("train.compute_time", it_time - data_time)

        end = time.time()
+        if timeout_handler.interrupted:
+            break


-def get_val_step(model_and_loss):
+def get_val_step(model_and_loss, use_amp=False):
    def _step(input, target):
        input_var = Variable(input)
        target_var = Variable(target)

-        with torch.no_grad():
+        with torch.no_grad(), autocast(enabled=use_amp):
            loss, output = model_and_loss(input_var, target_var)

-        prec1, prec5 = utils.accuracy(output.data, target, topk=(1, 5))
+            prec1, prec5 = utils.accuracy(output.data, target, topk=(1, 5))

-        if torch.distributed.is_initialized():
-            reduced_loss = utils.reduce_tensor(loss.data)
-            prec1 = utils.reduce_tensor(prec1)
-            prec5 = utils.reduce_tensor(prec5)
-        else:
-            reduced_loss = loss.data
+            if torch.distributed.is_initialized():
+                reduced_loss = utils.reduce_tensor(loss.data)
+                prec1 = utils.reduce_tensor(prec1)
+                prec5 = utils.reduce_tensor(prec5)
+            else:
+                reduced_loss = loss.data

        torch.cuda.synchronize()

@ -376,7 +349,13 @@ def get_val_step(model_and_loss):


 def validate(
-    val_loader, model_and_loss, fp16, logger, epoch, prof=-1, register_metrics=True
+    val_loader,
+    model_and_loss,
+    logger,
+    epoch,
+    use_amp=False,
+    prof=-1,
+    register_metrics=True,
 ):
    if register_metrics and logger is not None:
        logger.register_metric(
@ -440,7 +419,7 @@ def validate(
            metadata=TIME_METADATA,
        )

-    step = get_val_step(model_and_loss)
+    step = get_val_step(model_and_loss, use_amp=use_amp)

    top1 = log.AverageMeter()
    # switch to evaluate mode
@ -462,11 +441,11 @@ def validate(

        it_time = time.time() - end

-        top1.record(to_python_float(prec1), bs)
+        top1.record(prec1.item(), bs)
        if logger is not None:
-            logger.log_metric("val.top1", to_python_float(prec1), bs)
-            logger.log_metric("val.top5", to_python_float(prec5), bs)
-            logger.log_metric("val.loss", to_python_float(loss), bs)
+            logger.log_metric("val.top1", prec1.item(), bs)
+            logger.log_metric("val.top5", prec5.item(), bs)
+            logger.log_metric("val.loss", loss.item(), bs)
            logger.log_metric("val.compute_ips", calc_ips(bs, it_time - data_time))
            logger.log_metric("val.total_ips", calc_ips(bs, it_time))
            logger.log_metric("val.data_time", data_time)
@ -492,10 +471,10 @@ def calc_ips(batch_size, time):
 def train_loop(
    model_and_loss,
    optimizer,
+    scaler,
    lr_scheduler,
    train_loader,
    val_loader,
-    fp16,
    logger,
    should_backup_checkpoint,
    use_amp=False,
@ -510,70 +489,77 @@ def train_loop(
    checkpoint_dir="./",
    checkpoint_filename="checkpoint.pth.tar",
 ):
-
    prec1 = -1

    print(f"RUNNING EPOCHS FROM {start_epoch} TO {end_epoch}")
-    for epoch in range(start_epoch, end_epoch):
-        if logger is not None:
-            logger.start_epoch()
-        if not skip_training:
-            train(
-                train_loader,
-                model_and_loss,
-                optimizer,
-                lr_scheduler,
-                fp16,
-                logger,
-                epoch,
-                use_amp=use_amp,
-                prof=prof,
-                register_metrics=epoch == start_epoch,
-                batch_size_multiplier=batch_size_multiplier,
-            )
-
-        if not skip_validation:
-            prec1, nimg = validate(
-                val_loader,
-                model_and_loss,
-                fp16,
-                logger,
-                epoch,
-                prof=prof,
-                register_metrics=epoch == start_epoch,
-            )
-        if logger is not None:
-            logger.end_epoch()
-
-        if save_checkpoints and (
-            not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0
-        ):
-            if not skip_validation:
-                is_best = logger.metrics["val.top1"]["meter"].get_epoch() > best_prec1
-                best_prec1 = max(
-                    logger.metrics["val.top1"]["meter"].get_epoch(), best_prec1
+    with utils.TimeoutHandler() as timeout_handler:
+        for epoch in range(start_epoch, end_epoch):
+            if logger is not None:
+                logger.start_epoch()
+            if not skip_training:
+                train(
+                    train_loader,
+                    model_and_loss,
+                    optimizer,
+                    scaler,
+                    lr_scheduler,
+                    logger,
+                    epoch,
+                    timeout_handler,
+                    use_amp=use_amp,
+                    prof=prof,
+                    register_metrics=epoch == start_epoch,
+                    batch_size_multiplier=batch_size_multiplier,
                )
-            else:
-                is_best = False
-                best_prec1 = 0

-            if should_backup_checkpoint(epoch):
-                backup_filename = "checkpoint-{}.pth.tar".format(epoch + 1)
-            else:
-                backup_filename = None
-            utils.save_checkpoint(
-                {
-                    "epoch": epoch + 1,
-                    "arch": model_and_loss.arch,
-                    "state_dict": model_and_loss.model.state_dict(),
-                    "best_prec1": best_prec1,
-                    "optimizer": optimizer.state_dict(),
-                },
-                is_best,
-                checkpoint_dir=checkpoint_dir,
-                backup_filename=backup_filename,
-                filename=checkpoint_filename,
-            )
+            if not skip_validation:
+                prec1, nimg = validate(
+                    val_loader,
+                    model_and_loss,
+                    logger,
+                    epoch,
+                    use_amp=use_amp,
+                    prof=prof,
+                    register_metrics=epoch == start_epoch,
+                )
+            if logger is not None:
+                logger.end_epoch()
+
+            if save_checkpoints and (
+                not torch.distributed.is_initialized()
+                or torch.distributed.get_rank() == 0
+            ):
+                if not skip_validation:
+                    is_best = (
+                        logger.metrics["val.top1"]["meter"].get_epoch() > best_prec1
+                    )
+                    best_prec1 = max(
+                        logger.metrics["val.top1"]["meter"].get_epoch(), best_prec1
+                    )
+                else:
+                    is_best = False
+                    best_prec1 = 0
+
+                if should_backup_checkpoint(epoch):
+                    backup_filename = "checkpoint-{}.pth.tar".format(epoch + 1)
+                else:
+                    backup_filename = None
+                utils.save_checkpoint(
+                    {
+                        "epoch": epoch + 1,
+                        "arch": model_and_loss.arch,
+                        "state_dict": model_and_loss.model.state_dict(),
+                        "best_prec1": best_prec1,
+                        "optimizer": optimizer.state_dict(),
+                    },
+                    is_best,
+                    checkpoint_dir=checkpoint_dir,
+                    backup_filename=backup_filename,
+                    filename=checkpoint_filename,
+                )
+            if timeout_handler.interrupted:
+                break
+


 # }}}
--- a/PyTorch/Classification/ConvNets/image_classification/utils.py
+++ b/PyTorch/Classification/ConvNets/image_classification/utils.py
@ -31,6 +31,7 @@ import os
 import numpy as np
 import torch
 import shutil
+import signal
 import torch.distributed as dist


@ -106,3 +107,45 @@ def reduce_tensor(tensor):
 def first_n(n, generator):
    for i, d in zip(range(n), generator):
        yield d
+
+
+class TimeoutHandler:
+    def __init__(self, sig=signal.SIGTERM):
+        self.sig = sig
+        rank = dist.get_rank() if dist.is_initialized() else 0
+        self.device = f'cuda:{rank}'
+    @property
+    def interrupted(self):
+        if not dist.is_initialized():
+            return self._interrupted
+
+        interrupted = torch.tensor(self._interrupted).int().to(self.device)
+        dist.broadcast(interrupted, 0)
+        interrupted = bool(interrupted.item())
+        return interrupted
+    def __enter__(self):
+        self._interrupted = False
+        self.released = False
+        self.original_handler = signal.getsignal(self.sig)
+        def master_handler(signum, frame):
+            self.release()
+            self._interrupted = True
+            print(f'Received SIGTERM')
+        def ignorind_handler(signum, frame):
+            self.release()
+            print('Received SIGTERM, ignoring')
+
+        rank = dist.get_rank() if dist.is_initialized() else 0
+        if rank == 0:
+            signal.signal(self.sig, master_handler)
+        else:
+            signal.signal(self.sig, ignorind_handler)
+        return self
+    def __exit__(self, type, value, tb):
+        self.release()
+    def release(self):
+        if self.released:
+            return False
+        signal.signal(self.sig, self.original_handler)
+        self.released = True
+        return True
--- a/PyTorch/Classification/ConvNets/launch.py
+++ b/PyTorch/Classification/ConvNets/launch.py
@ -0,0 +1,50 @@
+import os
+from pathlib import Path
+from dataclasses import dataclass
+from typing import Dict, Any
+import yaml
+
+from main import main, add_parser_arguments
+import torch.backends.cudnn as cudnn
+
+import argparse
+
+
+def get_config_path():
+    return Path(os.path.dirname(os.path.abspath(__file__))) / "configs.yml"
+
+
+if __name__ == "__main__":
+    yaml_cfg_parser = argparse.ArgumentParser(add_help=False)
+    yaml_cfg_parser.add_argument(
+        "--cfg_file",
+        default=get_config_path(),
+        type=str,
+        help="path to yaml config file",
+    )
+    yaml_cfg_parser.add_argument("--model", default=None, type=str, required=True)
+    yaml_cfg_parser.add_argument("--mode", default=None, type=str, required=True)
+    yaml_cfg_parser.add_argument("--precision", default=None, type=str, required=True)
+    yaml_cfg_parser.add_argument("--platform", default=None, type=str, required=True)
+
+    yaml_args, rest = yaml_cfg_parser.parse_known_args()
+
+    with open(yaml_args.cfg_file, "r") as cfg_file:
+        config = yaml.load(cfg_file, Loader=yaml.FullLoader)
+
+    cfg = {
+        **config["precision"][yaml_args.precision],
+        **config["platform"][yaml_args.platform],
+        **config["models"][yaml_args.model][yaml_args.platform][yaml_args.precision],
+        **config["mode"][yaml_args.mode],
+    }
+    print(cfg)
+
+    parser = argparse.ArgumentParser(description="PyTorch ImageNet Training")
+    add_parser_arguments(parser)
+    parser.set_defaults(**cfg)
+    args = parser.parse_args(rest)
+    print(args)
+    cudnn.benchmark = True
+
+    main(args)
--- a/PyTorch/Classification/ConvNets/main.py
+++ b/PyTorch/Classification/ConvNets/main.py
@ -32,6 +32,7 @@ import os
 import shutil
 import time
 import random
+import signal

 import numpy as np
 import torch
@ -45,15 +46,7 @@ import torch.utils.data
 import torch.utils.data.distributed
 import torchvision.transforms as transforms
 import torchvision.datasets as datasets
-
-try:
-    from apex.parallel import DistributedDataParallel as DDP
-    from apex.fp16_utils import *
-    from apex import amp
-except ImportError:
-    raise ImportError(
-        "Please install apex from https://www.github.com/nvidia/apex to run this example."
-    )
+from torch.nn.parallel import DistributedDataParallel as DDP

 import image_classification.resnet as models
 import image_classification.logger as log
@ -224,12 +217,11 @@ def add_parser_arguments(parser):
        help="load weights from here",
    )

-    parser.add_argument("--fp16", action="store_true", help="Run model fp16 mode.")
    parser.add_argument(
        "--static-loss-scale",
        type=float,
        default=1,
-        help="Static loss scale, positive power of 2 values can improve fp16 convergence.",
+        help="Static loss scale, positive power of 2 values can improve amp convergence.",
    )
    parser.add_argument(
        "--dynamic-loss-scale",
@ -312,10 +304,6 @@ def main(args):
        dist.init_process_group(backend="nccl", init_method="env://")
        args.world_size = torch.distributed.get_world_size()

-    if args.amp and args.fp16:
-        print("Please use only one of the --fp16/--amp flags")
-        exit(1)
-
    if args.seed is not None:
        print("Using seed = {}".format(args.seed))
        torch.manual_seed(args.seed + args.local_rank)
@ -324,22 +312,25 @@ def main(args):
        random.seed(args.seed + args.local_rank)

        def _worker_init_fn(id):
+            def handler(signum, frame):
+                print(f"Worker {id} received signal {signum}")
+
+            signal.signal(signal.SIGTERM, handler)
+
            np.random.seed(seed=args.seed + args.local_rank + id)
            random.seed(args.seed + args.local_rank + id)

    else:

        def _worker_init_fn(id):
-            pass
+            def handler(signum, frame):
+                print(f"Worker {id} received signal {signum}")

-    if args.fp16:
-        assert (
-            torch.backends.cudnn.enabled
-        ), "fp16 mode requires cudnn backend to be enabled."
+            signal.signal(signal.SIGTERM, handler)

    if args.static_loss_scale != 1.0:
-        if not args.fp16:
-            print("Warning:  if --fp16 is not used, static_loss_scale will be ignored.")
+        if not args.amp:
+            print("Warning: if --amp is not used, static_loss_scale will be ignored.")

    if args.optimizer_batch_size < 0:
        batch_size_multiplier = 1
@ -387,6 +378,11 @@ def main(args):
                    args.resume, checkpoint["epoch"]
                )
            )
+            if start_epoch >= args.epochs:
+                print(
+                    f"Launched training for {args.epochs}, checkpoint already run {start_epoch}"
+                )
+                exit(1)
        else:
            print("=> no checkpoint found at '{}'".format(args.resume))
            model_state = None
@ -410,7 +406,6 @@ def main(args):
        loss,
        pretrained_weights=pretrained_weights,
        cuda=True,
-        fp16=args.fp16,
        memory_format=memory_format,
    )

@ -427,6 +422,9 @@ def main(args):
    elif args.data_backend == "syntetic":
        get_val_loader = get_syntetic_loader
        get_train_loader = get_syntetic_loader
+    else:
+        print("Bad databackend picked")
+        exit(1)

    train_loader, train_loader_len = get_train_loader(
        args.data,
@ -435,7 +433,6 @@ def main(args):
        args.mixup > 0.0,
        start_epoch=start_epoch,
        workers=args.workers,
-        fp16=args.fp16,
        memory_format=memory_format,
    )
    if args.mixup != 0.0:
@ -447,7 +444,6 @@ def main(args):
        args.num_classes,
        False,
        workers=args.workers,
-        fp16=args.fp16,
        memory_format=memory_format,
    )

@ -473,15 +469,12 @@ def main(args):

    optimizer = get_optimizer(
        list(model_and_loss.model.named_parameters()),
-        args.fp16,
        args.lr,
        args.momentum,
        args.weight_decay,
        nesterov=args.nesterov,
        bn_weight_decay=args.bn_weight_decay,
        state=optimizer_state,
-        static_loss_scale=args.static_loss_scale,
-        dynamic_loss_scale=args.dynamic_loss_scale,
    )

    if args.lr_schedule == "step":
@ -493,26 +486,26 @@ def main(args):
    elif args.lr_schedule == "linear":
        lr_policy = lr_linear_policy(args.lr, args.warmup, args.epochs, logger=logger)

-    if args.amp:
-        model_and_loss, optimizer = amp.initialize(
-            model_and_loss,
-            optimizer,
-            opt_level="O1",
-            loss_scale="dynamic" if args.dynamic_loss_scale else args.static_loss_scale,
-        )
+    scaler = torch.cuda.amp.GradScaler(
+        init_scale=args.static_loss_scale,
+        growth_factor=2,
+        backoff_factor=0.5,
+        growth_interval=100 if args.dynamic_loss_scale else 1000000000,
+        enabled=args.amp,
+    )

    if args.distributed:
-        model_and_loss.distributed()
+        model_and_loss.distributed(args.gpu)

    model_and_loss.load_model_state(model_state)

    train_loop(
        model_and_loss,
        optimizer,
+        scaler,
        lr_policy,
        train_loader,
        val_loader,
-        args.fp16,
        logger,
        should_backup_checkpoint(args),
        use_amp=args.amp,
--- a/PyTorch/Classification/ConvNets/requirements.txt
+++ b/PyTorch/Classification/ConvNets/requirements.txt
@ -1,2 +1 @@
-pytorch-ignite
 git+git://github.com/NVIDIA/dllogger.git@26a0f8f1958de2c0c460925ff6102a4d2486d6cc#egg=dllogger
--- a/PyTorch/Classification/ConvNets/resnet50v1.5/README.md
+++ b/PyTorch/Classification/ConvNets/resnet50v1.5/README.md
@ -30,12 +30,12 @@ achieve state-of-the-art accuracy, and is tested and maintained by NVIDIA.
    * [Inference performance benchmark](#inference-performance-benchmark)
  * [Results](#results)
    * [Training accuracy results](#training-accuracy-results)
-      * [Training accuracy: NVIDIA DGX A100 (8x A100 40GB)](#training-accuracy-nvidia-dgx-a100-8x-a100-40gb)
+      * [Training accuracy: NVIDIA DGX A100 (8x A100 80GB)](#training-accuracy-nvidia-dgx-a100-8x-a100-80gb)
      * [Training accuracy: NVIDIA DGX-1 (8x V100 16GB)](#training-accuracy-nvidia-dgx-1-8x-v100-16gb)
      * [Training accuracy: NVIDIA DGX-2 (16x V100 32GB)](#training-accuracy-nvidia-dgx-2-16x-v100-32gb)
      * [Example plots](#example-plots)
    * [Training performance results](#training-performance-results)
-      * [Training performance: NVIDIA DGX A100 (8x A100 40GB)](#training-performance-nvidia-dgx-a100-8x-a100-40gb)
+      * [Training performance: NVIDIA DGX A100 (8x A100 80GB)](#training-performance-nvidia-dgx-a100-8x-a100-80gb)
      * [Training performance: NVIDIA DGX-1 16GB (8x V100 16GB)](#training-performance-nvidia-dgx-1-16gb-8x-v100-16gb)
      * [Training performance: NVIDIA DGX-1 32GB (8x V100 32GB)](#training-performance-nvidia-dgx-1-32gb-8x-v100-32gb)
  * [Inference performance results](#inference-performance-results)
@ -119,6 +119,8 @@ and this recipe keeps the original assumption that validation is done on 224px i

 Using 288px images means that a lot more FLOPs are needed during inference to reach the same accuracy.

+
+
 ### Feature support matrix

 The following features are supported by this model:
@ -204,7 +206,7 @@ The following section lists the requirements that you need to meet in order to s
 This repository contains Dockerfile which extends the PyTorch NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:

 * [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
-* [PyTorch 20.06-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch) or newer
+* [PyTorch 20.12-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch) or newer
 * Supported GPUs:
    * [NVIDIA Volta architecture](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
    * [NVIDIA Turing architecture](https://www.nvidia.com/en-us/geforce/turing/)
@ -256,28 +258,28 @@ For the specifics concerning training and inference, see the [Advanced](#advance

 The directory in which the `train/` and `val/` directories are placed, is referred to as `<path to imagenet>` in this document.

-### 3. Build the RN50v1.5 PyTorch NGC container.
+### 3. Build the ResNet50 PyTorch NGC container.

 ```
-docker build . -t nvidia_rn50
+docker build . -t nvidia_resnet50
 ```

 ### 4. Start an interactive session in the NGC container to run training/inference.
 ```
-nvidia-docker run --rm -it -v <path to imagenet>:/data/imagenet --ipc=host nvidia_rn50
+nvidia-docker run --rm -it -v <path to imagenet>:/imagenet --ipc=host nvidia_resnet50
 ```


 ### 5. Start training

-To run training for a standard configuration (DGXA100/DGX1/DGX2, AMP/TF32/FP32, 50/90/250 Epochs),
+To run training for a standard configuration (DGXA100/DGX1V/DGX2V, AMP/TF32/FP32, 90/250 Epochs),
 run one of the scripts in the `./resnet50v1.5/training` directory
-called `./resnet50v1.5/training/{AMP, TF32, FP32}/{DGXA100, DGX1, DGX2}_RN50_{AMP, TF32, FP32}_{50,90,250}E.sh`.
+called `./resnet50v1.5/training/{AMP, TF32, FP32}/{ DGXA100, DGX1V, DGX2V }_resnet50_{AMP, TF32, FP32}_{ 90, 250 }E.sh`.

-Ensure ImageNet is mounted in the `/data/imagenet` directory.
+Ensure ImageNet is mounted in the `/imagenet` directory.

 Example:
-    `bash ./resnet50v1.5/training/AMP/DGX1_RN50_AMP_250E.sh <path were to store checkpoints and logs>`
+    `bash ./resnet50v1.5/training/AMP/DGX1_resnet50_AMP_250E.sh <path were to store checkpoints and logs>`

 ### 6. Start inference

@ -295,7 +297,7 @@ To run inference on ImageNet, run:

 To run inference on JPEG image using pretrained weights:

-`python classify.py --arch resnet50 -c fanin --weights nvidia_resnet50_200821.pth.tar  --precision AMP|FP32 --image <path to JPEG image>`
+`python classify.py --arch resnet50 -c fanin --weights nvidia_resnet50_200821.pth.tar --precision AMP|FP32 --image <path to JPEG image>`


 ## Advanced
@ -334,7 +336,7 @@ usage: main.py [-h] [--data-backend BACKEND] [--arch ARCH]
               [--lr-schedule SCHEDULE] [--warmup E] [--label-smoothing S]
               [--mixup ALPHA] [--momentum M] [--weight-decay W]
               [--bn-weight-decay] [--nesterov] [--print-freq N]
-               [--resume PATH] [--pretrained-weights PATH] [--fp16]
+               [--resume PATH] [--pretrained-weights PATH]
               [--static-loss-scale STATIC_LOSS_SCALE] [--dynamic-loss-scale]
               [--prof N] [--amp] [--seed SEED] [--gather-checkpoints]
               [--raport-file RAPORT_FILE] [--evaluate] [--training-only]
@ -353,8 +355,10 @@ optional arguments:
                        data backend: pytorch | syntetic | dali-gpu | dali-cpu
                        (default: dali-cpu)
  --arch ARCH, -a ARCH  model architecture: resnet18 | resnet34 | resnet50 |
-                        resnet101 | resnet152 | resnext101-32x4d | se-
-                        resnext101-32x4d (default: resnet50)
+                        resnet101 | resnet152 | resnext50-32x4d |
+                        resnext101-32x4d | resnext101-32x8d |
+                        resnext101-32x8d-basic | se-resnext101-32x4d (default:
+                        resnet50)
  --model-config CONF, -c CONF
                        model configs: classic | fanin | grp-fanin | grp-
                        fanout(default: classic)
@ -383,10 +387,9 @@ optional arguments:
  --resume PATH         path to latest checkpoint (default: none)
  --pretrained-weights PATH
                        load weights from here
-  --fp16                Run model fp16 mode.
  --static-loss-scale STATIC_LOSS_SCALE
                        Static loss scale, positive power of 2 values can
-                        improve fp16 convergence.
+                        improve amp convergence.
  --dynamic-loss-scale  Use dynamic loss scaling. If supplied, this argument
                        supersedes --static-loss-scale.
  --prof N              Run only N iterations
@ -404,6 +407,7 @@ optional arguments:
  --workspace DIR       path to directory where checkpoints will be stored
  --memory-format {nchw,nhwc}
                        memory layout, nchw or nhwc
+
 ```


@ -414,24 +418,7 @@ To use your own dataset, divide it in directories as in the following scheme:
 - Training images - `train/<class id>/<image>`
 - Validation images - `val/<class id>/<image>`

-If your dataset's has number of classes different than 1000, you need to add a custom config
-in the `image_classification/resnet.py` file.
-
-```python
-resnet_versions = {
-    ...
-    'resnet50-custom' : {
-       'net' : ResNet,
-       'block' : Bottleneck,
-       'layers' : [3, 4, 6, 3],
-       'widths' : [64, 128, 256, 512],
-       'expansion' : 4,
-       'num_classes' : <custom number of classes>,
-       }
-}
-```
-
-After adding the config, run the training script with `--arch resnet50-custom` flag.
+If your dataset's has number of classes different than 1000, you need to pass `--num-classes N` flag to the training script.

 ### Training process

@ -454,7 +441,7 @@ To restart training from checkpoint use `--resume` option.
 To start training from pretrained weights (e.g. downloaded from NGC) use `--pretrained-weights` option.

 The difference between those two is that the pretrained weights contain only model weights,
-and checkpoints, apart from model weights, contain optimizer state, LR scheduler state, RNG state.
+and checkpoints, apart from model weights, contain optimizer state, LR scheduler state.

 Checkpoints are suitable for dividing the training into parts, for example in order
 to divide the training job into shorter stages, or restart training after infrastructure fail.
@ -500,14 +487,13 @@ wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/resnet50_

 unzip resnet50_pyt_amp_20.06.0.zip
 ```
-
 To run inference on ImageNet, run:

 `python ./main.py --arch resnet50 --evaluate --epochs 1 --pretrained-weights nvidia_resnet50_200821.pth.tar -b <batch size> <path to imagenet>`

 To run inference on JPEG image using pretrained weights:

-`python classify.py --arch resnet50 -c fanin --weights nvidia_resnet50_200821.pth.tar  --precision AMP|FP32 --image <path to JPEG image>`
+`python classify.py --arch resnet50 --weights nvidia_resnet50_200821.pth.tar --precision AMP|FP32 --image <path to JPEG image>`


 ## Performance
@ -521,72 +507,63 @@ The following section shows how to run benchmarks measuring the model performanc
 To benchmark training, run:

 * For 1 GPU
-    * FP32
-`python ./main.py --arch resnet50 -b <batch_size> --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 <path to imagenet>`
+    * FP32 (V100 GPUs only)
+        `python ./launch.py --model resnet50 --precision FP32 --mode benchmark_training --platform DGX1V <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
+    * TF32 (A100 GPUs only)
+        `python ./launch.py --model resnet50 --precision TF32 --mode benchmark_training --platform DGXA100 <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
    * AMP
-`python ./main.py --arch resnet50 -b <batch_size> --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --amp --static-loss-scale 256 <path to imagenet>`
+        `python ./launch.py --model resnet50 --precision AMP --mode benchmark_training --platform <DGX1V|DGXA100> <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
 * For multiple GPUs
-    * FP32
-`python ./multiproc.py --nproc_per_node 8 ./main.py --arch resnet50 -b <batch_size> --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 <path to imagenet>`
+    * FP32 (V100 GPUs only)
+        `python ./launch.py --model resnet50 --precision FP32 --mode benchmark_training --platform DGX1V <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
+    * TF32 (A100 GPUs only)
+        `python ./multiproc.py --nproc_per_node 8 ./launch.py --model resnet50 --precision TF32 --mode benchmark_training --platform DGXA100 <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
    * AMP
-`python ./multiproc.py --nproc_per_node 8 ./main.py --arch resnet50 -b <batch_size> --training-only -p 1 --raport-file benchmark.json --amp --static-loss-scale 256 --epochs 1 --prof 100 <path to imagenet>`
+        `python ./multiproc.py --nproc_per_node 8 ./launch.py --model resnet50 --precision AMP --mode benchmark_training --platform <DGX1V|DGXA100> <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`

 Each of these scripts will run 100 iterations and save results in the `benchmark.json` file.

-Batch size should be picked appropriately depending on the hardware configuration.
-
-| *Platform* | *Precision* | *Batch Size* |
-|:----------:|:-----------:|:------------:|
-| DGXA100    | AMP         | 256          |
-| DGXA100    | TF32        | 256          |
-| DGX-1      | AMP         | 256          |
-| DGX-1      | FP32        | 128          |
-
 #### Inference performance benchmark

 To benchmark inference, run:

-* FP32
+* FP32 (V100 GPUs only)

-`python ./main.py --arch resnet50 -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --evaluate <path to imagenet>`
+`python ./launch.py --model resnet50 --precision FP32 --mode benchmark_inference --platform DGX1V <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
+
+* TF32 (A100 GPUs only)
+
+`python ./launch.py --model resnet50 --precision FP32 --mode benchmark_inference --platform DGXA100 <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`

 * AMP

-`python ./main.py --arch resnet50 -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --evaluate --amp <path to imagenet>`
+`python ./launch.py --model resnet50 --precision AMP --mode benchmark_inference --platform <DGX1V|DGXA100> <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`

 Each of these scripts will run 100 iterations and save results in the `benchmark.json` file.

-Batch size should be picked appropriately depending on the hardware configuration.
-
-| *Platform* | *Precision* | *Batch Size* |
-|:----------:|:-----------:|:------------:|
-| DGXA100    | AMP         | 256          |
-| DGXA100    | TF32        | 256          |
-| DGX-1      | AMP         | 256          |
-| DGX-1      | FP32        | 128          |
-
 ### Results

-Our results were obtained by running the applicable training script     in the pytorch-20.06 NGC container.
+Our results were obtained by running the applicable training script     in the pytorch-20.12 NGC container.

 To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).

 #### Training accuracy results

-##### Training accuracy: NVIDIA DGX A100 (8x A100 40GB)
+##### Training accuracy: NVIDIA DGX A100 (8x A100 80GB)

-| **epochs** | **Mixed Precision Top1** | **TF32 Top1** |
-|:------:|:--------------------:|:--------------:|
-|     90 |    76.93 +/- 0.23    | 76.85 +/- 0.30 |
+| **Epochs** | **Mixed Precision Top1** | **TF32 Top1**  |
+|:----------:|:------------------------:|:--------------:|
+|     90     |      77.12 +/- 0.11      | 76.95 +/- 0.18 |
+|    250     |      78.43 +/- 0.11      | 78.38 +/- 0.17 |


 ##### Training accuracy: NVIDIA DGX-1 (8x V100 16GB)

-| **epochs** | **Mixed Precision Top1** | **FP32 Top1** |
-|:-:|:-:|:-:|
-| 50 | 76.25 +/- 0.04 | 76.26 +/- 0.07 |
-|     90 |    77.09 +/- 0.10    | 77.01 +/- 0.16 |
-| 250 | 78.42 +/- 0.04 | 78.30 +/- 0.16 |
+| **Epochs** | **Mixed Precision Top1** | **FP32 Top1**  |
+|:----------:|:------------------------:|:--------------:|
+|     90     |      76.88 +/- 0.16      | 77.01 +/- 0.16 |
+|    250     |      78.25 +/- 0.12      | 78.30 +/- 0.16 |
+

 ##### Training accuracy: NVIDIA DGX-2 (16x V100 32GB)

@ -610,26 +587,28 @@ The following images show a 250 epochs configuration on a DGX-1V.

 #### Training performance results

-##### Training performance: NVIDIA DGX A100 (8x A100 40GB)
+##### Training performance: NVIDIA DGX A100 (8x A100 80GB)
+
+| **GPUs** | **Mixed Precision** |  **TF32**  | **Mixed Precision Speedup** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **TF32 Strong Scaling** | **TF32 Training Time (90E)** |
+|:--------:|:-------------------:|:----------:|:---------------------------:|:----------------------------------:|:---------------------------------------:|:-----------------------:|:----------------------------:|
+|    1     |     2461 img/s      | 945 img/s  |            2.6 x            |               1.0 x                |                ~14 hours                |          1.0 x          |          ~36 hours           |
+|    8     |     15977 img/s     | 7365 img/s |           2.16 x            |               6.49 x               |                ~3 hours                 |         7.78 x          |           ~5 hours           |

-|**GPUs**|**Mixed Precision**|  **TF32**   |**Mixed Precision Speedup**|**Mixed Precision Strong Scaling**|**Mixed Precision Training Time (90E)**|**TF32 Strong Scaling**|**TF32 Training Time (90E)**|
-|:------:|:-----------------:|:-----------:|:-------------------------:|:--------------------------------:|:-------------------------------------:|:---------------------:|:--------------------------:|
-|   1    |   1240.81 img/s   |680.15 img/s |           1.82x           |              1.00x               |               ~27 hours               |         1.00x         |         ~49 hours          |
-|   8    |   9604.92 img/s   |5379.82 img/s|           1.79x           |              7.74x               |               ~4 hours                |         7.91x         |          ~6 hours          |

 ##### Training performance: NVIDIA DGX-1 16GB (8x V100 16GB)

-|**GPUs**|**Mixed Precision**|  **FP32**   |**Mixed Precision Speedup**|**Mixed Precision Strong Scaling**|**Mixed Precision Training Time (90E)**|**FP32 Strong Scaling**|**FP32 Training Time (90E)**|
-|:------:|:-----------------:|:-----------:|:-------------------------:|:--------------------------------:|:-------------------------------------:|:---------------------:|:--------------------------:|
-|   1    |   856.52 img/s    |373.21 img/s |           2.30x           |              1.00x               |               ~39 hours               |         1.00x         |         ~89 hours          |
-|   8    |   6635.90 img/s   |2899.62 img/s|           2.29x           |              7.75x               |               ~5 hours                |         7.77x         |         ~12 hours          |
+| **GPUs** | **Mixed Precision** |  **FP32**  | **Mixed Precision Speedup** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **FP32 Strong Scaling** | **FP32 Training Time (90E)** |
+|:--------:|:-------------------:|:----------:|:---------------------------:|:----------------------------------:|:---------------------------------------:|:-----------------------:|:----------------------------:|
+|    1     |     1180 img/s      | 371 img/s  |           3.17 x            |               1.0 x                |                ~29 hours                |          1.0 x          |          ~91 hours           |
+|    8     |     7608 img/s      | 2851 img/s |           2.66 x            |               6.44 x               |                ~5 hours                 |         7.66 x          |          ~12 hours           |
+

 ##### Training performance: NVIDIA DGX-1 32GB (8x V100 32GB)

-|**GPUs**|**Mixed Precision**|  **FP32**   |**Mixed Precision Speedup**|**Mixed Precision Strong Scaling**|**Mixed Precision Training Time (90E)**|**FP32 Strong Scaling**|**FP32 Training Time (90E)**|
-|:------:|:-----------------:|:-----------:|:-------------------------:|:--------------------------------:|:-------------------------------------:|:---------------------:|:--------------------------:|
-|   1    |   816.00 img/s    |359.76 img/s |           2.27x           |              1.00x               |               ~41 hours               |         1.00x         |         ~93 hours          |
-|   8    |   6347.26 img/s   |2813.23 img/s|           2.26x           |              7.78x               |               ~5 hours                |         7.82x         |         ~12 hours          |
+| **GPUs** | **Mixed Precision** |  **FP32**  | **Mixed Precision Speedup** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **FP32 Strong Scaling** | **FP32 Training Time (90E)** |
+|:--------:|:-------------------:|:----------:|:---------------------------:|:----------------------------------:|:---------------------------------------:|:-----------------------:|:----------------------------:|
+|    1     |     1115 img/s      | 365 img/s  |           3.04 x            |               1.0 x                |                ~31 hours                |          1.0 x          |          ~92 hours           |
+|    8     |     7375 img/s      | 2811 img/s |           2.62 x            |               6.61 x               |                ~5 hours                 |         7.68 x          |          ~12 hours           |


 #### Inference performance results
@ -638,66 +617,66 @@ The following images show a 250 epochs configuration on a DGX-1V.

 ###### FP32 Inference Latency

-| **batch size** | **Throughput Avg** | **Latency Avg** | **Latency 90%** | **Latency 95%** | **Latency 99%** |
-|:-:|:-:|:-:|:-:|:-:|:-:|
-| 1 | 136.82 img/s | 7.12ms | 7.25ms | 8.36ms | 10.92ms |
-| 2 | 266.86 img/s | 7.27ms | 7.41ms | 7.85ms | 9.11ms |
-| 4 | 521.76 img/s | 7.44ms | 7.58ms | 8.14ms | 10.09ms |
-| 8 | 766.22 img/s | 10.18ms | 10.46ms | 10.97ms | 12.75ms |
-| 16 | 976.36 img/s | 15.79ms | 15.88ms | 15.95ms | 16.63ms |
-| 32 | 1092.27 img/s | 28.63ms | 28.71ms | 28.76ms | 29.30ms |
-| 64 | 1161.55 img/s | 53.69ms | 53.86ms | 53.90ms | 54.23ms |
-| 128 | 1209.12 img/s | 104.24ms | 104.68ms | 104.80ms | 105.00ms |
-| 256 | N/A | N/A | N/A | N/A | N/A |
+| **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
+|:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
+|       1        |      99 img/s      |    10.38 ms     |    11.24 ms     |    12.32 ms     |
+|       2        |     190 img/s      |    10.87 ms     |    12.18 ms     |    14.27 ms     |
+|       4        |     403 img/s      |    10.26 ms     |    11.02 ms     |    13.28 ms     |
+|       8        |     754 img/s      |    10.96 ms     |    11.99 ms     |    13.89 ms     |
+|       16       |     960 img/s      |    17.16 ms     |    16.74 ms     |    18.18 ms     |
+|       32       |     1057 img/s     |    31.39 ms     |     30.4 ms     |    30.55 ms     |
+|       64       |     1168 img/s     |     57.1 ms     |    55.01 ms     |    56.19 ms     |
+|      112       |     1166 img/s     |    100.78 ms    |    95.98 ms     |    97.43 ms     |
+|      128       |     1215 img/s     |    111.11 ms    |    105.52 ms    |    106.38 ms    |
+|      256       |     1253 img/s     |    217.03 ms    |    203.78 ms    |    208.68 ms    |
+

 ###### Mixed Precision Inference Latency

-| **batch size** | **Throughput Avg** | **Latency Avg** | **Latency 90%** | **Latency 95%** | **Latency 99%** |
-|:-:|:-:|:-:|:-:|:-:|:-:|
-| 1 | 114.97 img/s | 8.56ms | 9.32ms | 11.43ms | 12.79ms |
-| 2 | 238.70 img/s | 8.20ms | 8.75ms | 9.49ms | 12.31ms |
-| 4 | 448.69 img/s | 8.67ms | 9.20ms | 9.97ms | 10.60ms |
-| 8 | 875.00 img/s | 8.88ms | 9.31ms | 9.80ms | 10.82ms |
-| 16 | 1746.07 img/s | 8.89ms | 9.05ms | 9.56ms | 12.81ms |
-| 32 | 2004.28 img/s | 14.07ms | 14.14ms | 14.31ms | 14.92ms |
-| 64 | 2254.60 img/s | 25.93ms | 26.05ms | 26.07ms | 26.17ms |
-| 128 | 2360.14 img/s | 50.14ms | 50.28ms | 50.34ms | 50.68ms |
-| 256 | 2342.13 img/s | 96.74ms | 96.91ms | 96.99ms | 97.14ms |
-
+| **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
+|:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
+|       1        |      82 img/s      |    12.43 ms     |    13.29 ms     |    14.89 ms     |
+|       2        |     157 img/s      |    13.04 ms     |    13.84 ms     |    16.79 ms     |
+|       4        |     310 img/s      |    13.26 ms     |    14.42 ms     |    15.63 ms     |
+|       8        |     646 img/s      |    12.69 ms     |    13.65 ms     |    15.48 ms     |
+|       16       |     1188 img/s     |    14.01 ms     |    15.56 ms     |    18.34 ms     |
+|       32       |     2093 img/s     |    16.41 ms     |    18.25 ms     |     19.9 ms     |
+|       64       |     2899 img/s     |    24.12 ms     |    22.14 ms     |    22.55 ms     |
+|      128       |     3142 img/s     |    45.28 ms     |    40.77 ms     |    42.89 ms     |
+|      256       |     3276 img/s     |    88.44 ms     |     77.8 ms     |    79.01 ms     |
+|      256       |     3276 img/s     |     88.6 ms     |    77.74 ms     |    79.11 ms     |


 ##### Inference performance: NVIDIA T4

 ###### FP32 Inference Latency

-| **batch size** | **Throughput Avg** | **Latency Avg** | **Latency 90%** | **Latency 95%** | **Latency 99%** |
-|:-:|:-:|:-:|:-:|:-:|:-:|
-| 1 | 179.85 img/s | 5.51ms | 5.65ms | 7.34ms | 10.97ms |
-| 2 | 348.12 img/s | 5.67ms | 5.95ms | 6.33ms | 9.81ms |
-| 4 | 556.27 img/s | 7.03ms | 7.34ms | 8.13ms | 9.65ms |
-| 8 | 740.43 img/s | 10.32ms | 10.33ms | 10.60ms | 13.87ms |
-| 16 | 909.17 img/s | 17.19ms | 17.15ms | 18.13ms | 21.06ms |
-| 32 | 999.07 img/s | 31.07ms | 31.12ms | 31.17ms | 32.41ms |
-| 64 | 1090.47 img/s | 57.62ms | 57.84ms | 57.91ms | 58.05ms |
-| 128 | 1142.46 img/s | 110.94ms | 111.15ms | 111.23ms | 112.16ms |
-| 256 | N/A | N/A | N/A | N/A | N/A |
+| **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
+|:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
+|       1        |     147 img/s      |     7.28 ms     |     8.48 ms     |     9.79 ms     |
+|       2        |     251 img/s      |     8.48 ms     |    10.23 ms     |    14.01 ms     |
+|       4        |     303 img/s      |    13.57 ms     |    13.61 ms     |    15.42 ms     |
+|       8        |     329 img/s      |     24.7 ms     |    24.74 ms     |     25.0 ms     |
+|       16       |     371 img/s      |    43.73 ms     |    43.74 ms     |    44.03 ms     |
+|       32       |     395 img/s      |    82.36 ms     |    82.13 ms     |    82.58 ms     |
+|       64       |     421 img/s      |    155.37 ms    |    153.07 ms    |    153.55 ms    |
+|      128       |     426 img/s      |    309.06 ms    |    303.0 ms     |    307.42 ms    |
+|      256       |     419 img/s      |    631.43 ms    |    612.42 ms    |    614.82 ms    |
+

 ###### Mixed Precision Inference Latency

-| **batch size** | **Throughput Avg** | **Latency Avg** | **Latency 90%** | **Latency 95%** | **Latency 99%** |
-|:-:|:-:|:-:|:-:|:-:|:-:|
-| 1 | 163.78 img/s | 6.05ms | 5.92ms | 7.98ms | 11.58ms |
-| 2 | 333.43 img/s | 5.91ms | 6.05ms | 6.63ms | 11.52ms |
-| 4 | 645.45 img/s | 6.04ms | 6.33ms | 7.01ms | 8.90ms |
-| 8 | 1164.15 img/s | 6.73ms | 7.31ms | 8.04ms | 12.41ms |
-| 16 | 1606.42 img/s | 9.53ms | 9.86ms | 10.52ms | 17.01ms |
-| 32 | 1857.29 img/s | 15.67ms | 15.61ms | 16.14ms | 18.66ms |
-| 64 | 2011.62 img/s | 28.64ms | 28.69ms | 28.82ms | 31.06ms |
-| 128 | 2083.90 img/s | 54.87ms | 54.96ms | 54.99ms | 55.27ms |
-| 256 | 2043.72 img/s | 106.51ms | 106.62ms | 106.68ms | 107.03ms |
-
-
-
+| **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
+|:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
+|       1        |     112 img/s      |     9.25 ms     |     9.87 ms     |    10.62 ms     |
+|       2        |     223 img/s      |     9.4 ms      |    10.62 ms     |     13.9 ms     |
+|       4        |     468 img/s      |     9.06 ms     |    11.15 ms     |     15.5 ms     |
+|       8        |     844 img/s      |    10.05 ms     |    12.67 ms     |    17.86 ms     |
+|       16       |     1037 img/s     |    16.01 ms     |    15.66 ms     |    15.86 ms     |
+|       32       |     1103 img/s     |    30.27 ms     |    29.45 ms     |    29.74 ms     |
+|       64       |     1154 img/s     |    57.96 ms     |    56.33 ms     |    56.96 ms     |
+|      128       |     1177 img/s     |    114.95 ms    |    110.4 ms     |    111.1 ms     |
+|      256       |     1184 img/s     |    229.61 ms    |    217.84 ms    |    224.75 ms    |


 ## Release notes
@ -720,9 +699,9 @@ The following images show a 250 epochs configuration on a DGX-1V.
 5. July 2020
  * Added A100 scripts
  * Updated README
-
+6. February 2021
+  * Moved from APEX AMP to Native AMP
 ### Known issues

 There are no known issues with this model.

-
--- a/PyTorch/Classification/ConvNets/resnet50v1.5/training/AMP/DGX1V_resnet50_AMP_250E.sh
+++ b/PyTorch/Classification/ConvNets/resnet50v1.5/training/AMP/DGX1V_resnet50_AMP_250E.sh
@ -0,0 +1 @@
+python ./launch.py --model resnet50 --precision AMP --mode convergence --platform DGX1V /imagenet --workspace ${1:-./} --raport-file raport.json
--- a/PyTorch/Classification/ConvNets/resnet50v1.5/training/AMP/DGX1V_resnet50_AMP_90E.sh
+++ b/PyTorch/Classification/ConvNets/resnet50v1.5/training/AMP/DGX1V_resnet50_AMP_90E.sh
@ -0,0 +1 @@
+python ./launch.py --model resnet50 --precision AMP --mode convergence --platform DGX1V /imagenet --epochs 90 --mixup 0.0 --workspace ${1:-./} --raport-file raport.json
--- a/PyTorch/Classification/ConvNets/resnet50v1.5/training/AMP/DGX1_RN50_AMP_250E.sh
+++ b/PyTorch/Classification/ConvNets/resnet50v1.5/training/AMP/DGX1_RN50_AMP_250E.sh
@ -1 +0,0 @@
-python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --data-backend dali-cpu --raport-file raport.json -j8 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 256 --amp --static-loss-scale 128 --epochs 250 --mixup 0.2
--- a/PyTorch/Classification/ConvNets/resnet50v1.5/training/AMP/DGX1_RN50_AMP_50E.sh
+++ b/PyTorch/Classification/ConvNets/resnet50v1.5/training/AMP/DGX1_RN50_AMP_50E.sh
@ -1 +0,0 @@
-python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --data-backend dali-cpu --raport-file raport.json -j8 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 256 --amp --static-loss-scale 128 --epochs 50
--- a/PyTorch/Classification/ConvNets/resnet50v1.5/training/AMP/DGX1_RN50_AMP_90E.sh
+++ b/PyTorch/Classification/ConvNets/resnet50v1.5/training/AMP/DGX1_RN50_AMP_90E.sh
@ -1 +0,0 @@
-python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --data-backend dali-cpu --raport-file raport.json -j8 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 256 --amp --static-loss-scale 128 --epochs 90
--- a/PyTorch/Classification/ConvNets/resnet50v1.5/training/AMP/DGX2V_resnet50_AMP_250E.sh
+++ b/PyTorch/Classification/ConvNets/resnet50v1.5/training/AMP/DGX2V_resnet50_AMP_250E.sh
@ -0,0 +1 @@
+python ./launch.py --model resnet50 --precision AMP --mode convergence --platform DGX2V /imagenet --workspace ${1:-./} --raport-file raport.json
--- a/PyTorch/Classification/ConvNets/resnet50v1.5/training/AMP/DGX2V_resnet50_AMP_90E.sh
+++ b/PyTorch/Classification/ConvNets/resnet50v1.5/training/AMP/DGX2V_resnet50_AMP_90E.sh
@ -0,0 +1 @@
+python ./launch.py --model resnet50 --precision AMP --mode convergence --platform DGX2V /imagenet --epochs 90 --mixup 0.0 --workspace ${1:-./} --raport-file raport.json
--- a/PyTorch/Classification/ConvNets/resnet50v1.5/training/AMP/DGX2_RN50_AMP_250E.sh
+++ b/PyTorch/Classification/ConvNets/resnet50v1.5/training/AMP/DGX2_RN50_AMP_250E.sh
@ -1 +0,0 @@
-python ./multiproc.py --nproc_per_node 16 ./main.py /imagenet --data-backend dali-gpu --raport-file raport.json -j8 -p 100 --lr 4.096 --optimizer-batch-size 4096 --warmup 16 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 256 --amp --static-loss-scale 128 --epochs 250 --mixup 0.2
--- a/PyTorch/Classification/ConvNets/resnet50v1.5/training/AMP/DGX2_RN50_AMP_50E.sh
+++ b/PyTorch/Classification/ConvNets/resnet50v1.5/training/AMP/DGX2_RN50_AMP_50E.sh
@ -1 +0,0 @@
-python ./multiproc.py --nproc_per_node 16 ./main.py /imagenet --data-backend dali-gpu --raport-file raport.json -j8 -p 100 --lr 4.096 --optimizer-batch-size 4096 --warmup 16 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 256 --amp --static-loss-scale 128 --epochs 50
--- a/PyTorch/Classification/ConvNets/resnet50v1.5/training/AMP/DGX2_RN50_AMP_90E.sh
+++ b/PyTorch/Classification/ConvNets/resnet50v1.5/training/AMP/DGX2_RN50_AMP_90E.sh
@ -1 +0,0 @@
-python ./multiproc.py --nproc_per_node 16 ./main.py /imagenet --data-backend dali-gpu --raport-file raport.json -j8 -p 100 --lr 4.096 --optimizer-batch-size 4096 --warmup 16 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 256 --amp --static-loss-scale 128 --epochs 90
--- a/PyTorch/Classification/ConvNets/resnet50v1.5/training/AMP/DGXA100_RN50_AMP_90E.sh
+++ b/PyTorch/Classification/ConvNets/resnet50v1.5/training/AMP/DGXA100_RN50_AMP_90E.sh
@ -1 +0,0 @@
-python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --data-backend dali-cpu --raport-file raport.json -j16 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 256 --amp --static-loss-scale 128 --epochs 90
--- a/PyTorch/Classification/ConvNets/resnet50v1.5/training/AMP/DGXA100_resnet50_AMP_250E.sh
+++ b/PyTorch/Classification/ConvNets/resnet50v1.5/training/AMP/DGXA100_resnet50_AMP_250E.sh
@ -0,0 +1 @@
+python ./launch.py --model resnet50 --precision AMP --mode convergence --platform DGXA100 /imagenet --workspace ${1:-./} --raport-file raport.json
--- a/PyTorch/Classification/ConvNets/resnet50v1.5/training/AMP/DGXA100_resnet50_AMP_90E.sh
+++ b/PyTorch/Classification/ConvNets/resnet50v1.5/training/AMP/DGXA100_resnet50_AMP_90E.sh
@ -0,0 +1 @@
+python ./launch.py --model resnet50 --precision AMP --mode convergence --platform DGXA100 /imagenet --epochs 90 --mixup 0.0 --workspace ${1:-./} --raport-file raport.json
--- a/PyTorch/Classification/ConvNets/resnet50v1.5/training/FP32/DGX1V_resnet50_FP32_250E.sh
+++ b/PyTorch/Classification/ConvNets/resnet50v1.5/training/FP32/DGX1V_resnet50_FP32_250E.sh
@ -0,0 +1 @@
+python ./launch.py --model resnet50 --precision FP32 --mode convergence --platform DGX1V /imagenet --workspace ${1:-./} --raport-file raport.json
--- a/PyTorch/Classification/ConvNets/resnet50v1.5/training/FP32/DGX1V_resnet50_FP32_90E.sh
+++ b/PyTorch/Classification/ConvNets/resnet50v1.5/training/FP32/DGX1V_resnet50_FP32_90E.sh
@ -0,0 +1 @@
+python ./launch.py --model resnet50 --precision FP32 --mode convergence --platform DGX1V /imagenet --epochs 90 --mixup 0.0 --workspace ${1:-./} --raport-file raport.json
--- a/PyTorch/Classification/ConvNets/resnet50v1.5/training/FP32/DGX1_RN50_FP32_250E.sh
+++ b/PyTorch/Classification/ConvNets/resnet50v1.5/training/FP32/DGX1_RN50_FP32_250E.sh
@ -1 +0,0 @@
-python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --data-backend dali-cpu --raport-file raport.json -j8 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 128 --epochs 250 --mixup 0.2
--- a/PyTorch/Classification/ConvNets/resnet50v1.5/training/FP32/DGX1_RN50_FP32_50E.sh
+++ b/PyTorch/Classification/ConvNets/resnet50v1.5/training/FP32/DGX1_RN50_FP32_50E.sh
@ -1 +0,0 @@
-python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --data-backend dali-cpu --raport-file raport.json -j8 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 128 --epochs 50
--- a/PyTorch/Classification/ConvNets/resnet50v1.5/training/FP32/DGX1_RN50_FP32_90E.sh
+++ b/PyTorch/Classification/ConvNets/resnet50v1.5/training/FP32/DGX1_RN50_FP32_90E.sh
@ -1 +0,0 @@
-python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --data-backend dali-cpu --raport-file raport.json -j8 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 128 --epochs 90
--- a/PyTorch/Classification/ConvNets/resnet50v1.5/training/FP32/DGX2V_resnet50_FP32_250E.sh
+++ b/PyTorch/Classification/ConvNets/resnet50v1.5/training/FP32/DGX2V_resnet50_FP32_250E.sh
@ -0,0 +1 @@
+python ./launch.py --model resnet50 --precision FP32 --mode convergence --platform DGX2V /imagenet --workspace ${1:-./} --raport-file raport.json
--- a/PyTorch/Classification/ConvNets/resnet50v1.5/training/FP32/DGX2V_resnet50_FP32_90E.sh
+++ b/PyTorch/Classification/ConvNets/resnet50v1.5/training/FP32/DGX2V_resnet50_FP32_90E.sh
@ -0,0 +1 @@
+python ./launch.py --model resnet50 --precision FP32 --mode convergence --platform DGX2V /imagenet --epochs 90 --mixup 0.0 --workspace ${1:-./} --raport-file raport.json
--- a/PyTorch/Classification/ConvNets/resnet50v1.5/training/FP32/DGX2_RN50_FP32_250E.sh
+++ b/PyTorch/Classification/ConvNets/resnet50v1.5/training/FP32/DGX2_RN50_FP32_250E.sh
@ -1 +0,0 @@
-python ./multiproc.py --nproc_per_node 16 ./main.py /imagenet --data-backend dali-gpu --raport-file raport.json -j8 -p 100 --lr 4.096 --optimizer-batch-size 4096 --warmup 16 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 128 --epochs 250 --mixup 0.2
--- a/PyTorch/Classification/ConvNets/resnet50v1.5/training/FP32/DGX2_RN50_FP32_50E.sh
+++ b/PyTorch/Classification/ConvNets/resnet50v1.5/training/FP32/DGX2_RN50_FP32_50E.sh
@ -1 +0,0 @@
-python ./multiproc.py --nproc_per_node 16 ./main.py /imagenet --data-backend dali-gpu --raport-file raport.json -j8 -p 100 --lr 4.096 --optimizer-batch-size 4096 --warmup 16 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 128 --epochs 50
--- a/PyTorch/Classification/ConvNets/resnet50v1.5/training/FP32/DGX2_RN50_FP32_90E.sh
+++ b/PyTorch/Classification/ConvNets/resnet50v1.5/training/FP32/DGX2_RN50_FP32_90E.sh
@ -1 +0,0 @@
-python ./multiproc.py --nproc_per_node 16 ./main.py /imagenet --data-backend dali-gpu --raport-file raport.json -j8 -p 100 --lr 4.096 --optimizer-batch-size 4096 --warmup 16 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 128 --epochs 90
--- a/PyTorch/Classification/ConvNets/resnet50v1.5/training/TF32/DGXA100_RN50_TF32_90E.sh
+++ b/PyTorch/Classification/ConvNets/resnet50v1.5/training/TF32/DGXA100_RN50_TF32_90E.sh
@ -1 +0,0 @@
-python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --data-backend dali-cpu --raport-file raport.json -j16 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 256 --epochs 90
--- a/PyTorch/Classification/ConvNets/resnet50v1.5/training/TF32/DGXA100_resnet50_TF32_250E.sh
+++ b/PyTorch/Classification/ConvNets/resnet50v1.5/training/TF32/DGXA100_resnet50_TF32_250E.sh
@ -0,0 +1 @@
+python ./launch.py --model resnet50 --precision TF32 --mode convergence --platform DGXA100 /imagenet --workspace ${1:-./} --raport-file raport.json
--- a/PyTorch/Classification/ConvNets/resnet50v1.5/training/TF32/DGXA100_resnet50_TF32_90E.sh
+++ b/PyTorch/Classification/ConvNets/resnet50v1.5/training/TF32/DGXA100_resnet50_TF32_90E.sh
@ -0,0 +1 @@
+python ./launch.py --model resnet50 --precision TF32 --mode convergence --platform DGXA100 /imagenet --epochs 90 --mixup 0.0 --workspace ${1:-./} --raport-file raport.json
--- a/PyTorch/Classification/ConvNets/resnext101-32x4d/README.md
+++ b/PyTorch/Classification/ConvNets/resnext101-32x4d/README.md
@ -31,11 +31,11 @@ achieve state-of-the-art accuracy, and is tested and maintained by NVIDIA.
    * [Inference performance benchmark](#inference-performance-benchmark)
  * [Results](#results)
    * [Training accuracy results](#training-accuracy-results)
-      * [Training accuracy: NVIDIA DGX A100 (8x A100 40GB)](#training-accuracy-nvidia-dgx-a100-8x-a100-40gb)
+      * [Training accuracy: NVIDIA DGX A100 (8x A100 80GB)](#training-accuracy-nvidia-dgx-a100-8x-a100-80gb)
      * [Training accuracy: NVIDIA DGX-1 (8x V100 16GB)](#training-accuracy-nvidia-dgx-1-8x-v100-16gb)
      * [Example plots](#example-plots)
    * [Training performance results](#training-performance-results)
-      * [Training performance: NVIDIA DGX A100 (8x A100 40GB)](#training-performance-nvidia-dgx-a100-8x-a100-40gb)
+      * [Training performance: NVIDIA DGX A100 (8x A100 80GB)](#training-performance-nvidia-dgx-a100-8x-a100-80gb)
      * [Training performance: NVIDIA DGX-1 16GB (8x V100 16GB)](#training-performance-nvidia-dgx-1-16gb-8x-v100-16gb)
      * [Training performance: NVIDIA DGX-1 32GB (8x V100 32GB)](#training-performance-nvidia-dgx-1-32gb-8x-v100-32gb)
  * [Inference performance results](#inference-performance-results)
@ -111,7 +111,7 @@ The following features are supported by this model:

 | Feature               | ResNeXt101-32x4d
 |-----------------------|--------------------------
-|[DALI](https://docs.nvidia.com/deeplearning/dali/release-notes/index.html)   |   Yes
+|[DALI](https://docs.nvidia.com/deeplearning/sdk/dali-release-notes/index.html)   |   Yes
 |[APEX AMP](https://nvidia.github.io/apex/amp.html) | Yes |

 #### Features
@ -128,11 +128,11 @@ which speeds up data loading when CPU becomes a bottleneck.
 DALI can use CPU or GPU, and outperforms the PyTorch native dataloader.

 Run training with `--data-backends dali-gpu` or `--data-backends dali-cpu` to enable DALI.
-For ResNeXt101-32x4d, for DGXA100, DGX1 and DGX2 we recommend `--data-backends dali-cpu`.
+For DGXA100 and DGX1 we recommend `--data-backends dali-cpu`.

 ### Mixed precision training

-Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores) in the Volta and Turing architecture, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:
+Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores) in Volta, and following with both the Turing and Ampere architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:
 1.  Porting the model to use the FP16 data type where appropriate.
 2.  Adding loss scaling to preserve small gradient values.

@ -190,7 +190,7 @@ The following section lists the requirements that you need to meet in order to s
 This repository contains Dockerfile which extends the PyTorch NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:

 * [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
-* [PyTorch 20.06-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch) or newer
+* [PyTorch 20.12-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch) or newer
 * Supported GPUs:
    * [NVIDIA Volta architecture](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
    * [NVIDIA Turing architecture](https://www.nvidia.com/en-us/geforce/turing/)
@ -242,27 +242,27 @@ For the specifics concerning training and inference, see the [Advanced](#advance

 The directory in which the `train/` and `val/` directories are placed, is referred to as `<path to imagenet>` in this document.

-### 3. Build the RNXT101-32x4d PyTorch NGC container.
+### 3. Build the ResNeXt101-32x4d PyTorch NGC container.

 ```
-docker build . -t nvidia_rnxt101-32x4d
+docker build . -t nvidia_resnext101-32x4d
 ```

 ### 4. Start an interactive session in the NGC container to run training/inference.
 ```
-nvidia-docker run --rm -it -v <path to imagenet>:/imagenet --ipc=host nvidia_rnxt101-32x4d
+nvidia-docker run --rm -it -v <path to imagenet>:/imagenet --ipc=host nvidia_resnext101-32x4d
 ```

 ### 5. Start training

-To run training for a standard configuration (DGXA100/DGX1/DGX2, AMP/TF32/FP32, 90/250 Epochs),
+To run training for a standard configuration (DGXA100/DGX1V, AMP/TF32/FP32, 90/250 Epochs),
 run one of the scripts in the `./resnext101-32x4d/training` directory
-called `./resnext101-32x4d/training/{AMP, TF32, FP32}/{DGXA100, DGX1, DGX2}_RNXT101-32x4d_{AMP, TF32, FP32}_{90,250}E.sh`.
+called `./resnext101-32x4d/training/{AMP, TF32, FP32}/{ DGXA100, DGX1V }_resnext101-32x4d_{AMP, TF32, FP32}_{ 90, 250 }E.sh`.

 Ensure ImageNet is mounted in the `/imagenet` directory.

 Example:
-    `bash ./resnext101-32x4d/training/AMP/DGX1_RNXT101-32x4d_AMP_250E.sh <path were to store checkpoints and logs>`
+    `bash ./resnext101-32x4d/training/AMP/DGX1_resnext101-32x4d_AMP_250E.sh <path were to store checkpoints and logs>`

 ### 6. Start inference

@ -280,7 +280,7 @@ To run inference on ImageNet, run:

 To run inference on JPEG image using pretrained weights:

-`python classify.py --arch resnext101-32x4d -c fanin --weights nvidia_resnext101-32x4d_200821.pth.tar  --precision AMP|FP32 --image <path to JPEG image>`
+`python classify.py --arch resnext101-32x4d -c fanin --weights nvidia_resnext101-32x4d_200821.pth.tar --precision AMP|FP32 --image <path to JPEG image>`


 ## Advanced
@ -319,7 +319,7 @@ usage: main.py [-h] [--data-backend BACKEND] [--arch ARCH]
               [--lr-schedule SCHEDULE] [--warmup E] [--label-smoothing S]
               [--mixup ALPHA] [--momentum M] [--weight-decay W]
               [--bn-weight-decay] [--nesterov] [--print-freq N]
-               [--resume PATH] [--pretrained-weights PATH] [--fp16]
+               [--resume PATH] [--pretrained-weights PATH]
               [--static-loss-scale STATIC_LOSS_SCALE] [--dynamic-loss-scale]
               [--prof N] [--amp] [--seed SEED] [--gather-checkpoints]
               [--raport-file RAPORT_FILE] [--evaluate] [--training-only]
@ -338,8 +338,10 @@ optional arguments:
                        data backend: pytorch | syntetic | dali-gpu | dali-cpu
                        (default: dali-cpu)
  --arch ARCH, -a ARCH  model architecture: resnet18 | resnet34 | resnet50 |
-                        resnet101 | resnet152 | resnext101-32x4d | se-
-                        resnext101-32x4d (default: resnet50)
+                        resnet101 | resnet152 | resnext50-32x4d |
+                        resnext101-32x4d | resnext101-32x8d |
+                        resnext101-32x8d-basic | se-resnext101-32x4d (default:
+                        resnet50)
  --model-config CONF, -c CONF
                        model configs: classic | fanin | grp-fanin | grp-
                        fanout(default: classic)
@ -368,10 +370,9 @@ optional arguments:
  --resume PATH         path to latest checkpoint (default: none)
  --pretrained-weights PATH
                        load weights from here
-  --fp16                Run model fp16 mode.
  --static-loss-scale STATIC_LOSS_SCALE
                        Static loss scale, positive power of 2 values can
-                        improve fp16 convergence.
+                        improve amp convergence.
  --dynamic-loss-scale  Use dynamic loss scaling. If supplied, this argument
                        supersedes --static-loss-scale.
  --prof N              Run only N iterations
@ -399,25 +400,7 @@ To use your own dataset, divide it in directories as in the following scheme:
 - Training images - `train/<class id>/<image>`
 - Validation images - `val/<class id>/<image>`

-If your dataset's has number of classes different than 1000, you need to add a custom config
-in the `image_classification/resnet.py` file.
-
-```python
-resnet_versions = {
-    ...
-    'resnext101-32x4d-custom' : {
-        'net' : ResNet,
-        'block' : Bottleneck,
-        'cardinality' : 32,
-        'layers' : [3, 4, 23, 3],
-        'widths' : [128, 256, 512, 1024],
-        'expansion' : 2,
-        'num_classes' : <custom number of classes>,
-    }
-}
-```
-
-After adding the config, run the training script with `--arch resnext101-32x4d-custom` flag.
+If your dataset's has number of classes different than 1000, you need to pass `--num-classes N` flag to the training script.

 ### Training process

@ -440,7 +423,7 @@ To restart training from checkpoint use `--resume` option.
 To start training from pretrained weights (e.g. downloaded from NGC) use `--pretrained-weights` option.

 The difference between those two is that the pretrained weights contain only model weights,
-and checkpoints, apart from model weights, contain optimizer state, LR scheduler state, RNG state.
+and checkpoints, apart from model weights, contain optimizer state, LR scheduler state.

 Checkpoints are suitable for dividing the training into parts, for example in order
 to divide the training job into shorter stages, or restart training after infrastructure fail.
@ -482,9 +465,9 @@ You can also run ImageNet validation on pretrained weights:
 Pretrained weights can be downloaded from NGC:

 ```bash
-wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/resnext101-32x4d_pyt_amp/versions/20.06.0/zip -O resnext101-32x4d_pyt_amp_20.06.0.zip
+wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/resnext101_32x4d_pyt_amp/versions/20.06.0/zip -O resnext101_32x4d_pyt_amp_20.06.0.zip

-unzip resnext101-32x4d_pyt_amp_20.06.0.zip
+unzip resnext101_32x4d_pyt_amp_20.06.0.zip
 ```

 To run inference on ImageNet, run:
@ -493,7 +476,7 @@ To run inference on ImageNet, run:

 To run inference on JPEG image using pretrained weights:

-`python classify.py --arch resnext101-32x4d -c fanin --weights nvidia_resnext101-32x4d_200821.pth.tar  --precision AMP|FP32 --image <path to JPEG image>`
+`python classify.py --arch resnext101-32x4d --weights nvidia_resnext101-32x4d_200821.pth.tar --precision AMP|FP32 --image <path to JPEG image>`


 ## Performance
@ -507,71 +490,62 @@ The following section shows how to run benchmarks measuring the model performanc
 To benchmark training, run:

 * For 1 GPU
-    * FP32
-`python ./main.py --arch resnext101-32x4d -b <batch_size> --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 <path to imagenet>`
+    * FP32 (V100 GPUs only)
+        `python ./launch.py --model resnext101-32x4d --precision FP32 --mode benchmark_training --platform DGX1V <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
+    * TF32 (A100 GPUs only)
+        `python ./launch.py --model resnext101-32x4d --precision TF32 --mode benchmark_training --platform DGXA100 <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
    * AMP
-`python ./main.py --arch resnext101-32x4d -b <batch_size> --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --amp --static-loss-scale 256 --memory-format nhwc <path to imagenet>`
+        `python ./launch.py --model resnext101-32x4d --precision AMP --mode benchmark_training --platform <DGX1V|DGXA100> <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
 * For multiple GPUs
-    * FP32
-`python ./multiproc.py --nproc_per_node 8 ./main.py --arch resnext101-32x4d -b <batch_size> --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 <path to imagenet>`
+    * FP32 (V100 GPUs only)
+        `python ./launch.py --model resnext101-32x4d --precision FP32 --mode benchmark_training --platform DGX1V <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
+    * TF32 (A100 GPUs only)
+        `python ./multiproc.py --nproc_per_node 8 ./launch.py --model resnext101-32x4d --precision TF32 --mode benchmark_training --platform DGXA100 <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
    * AMP
-`python ./multiproc.py --nproc_per_node 8 ./main.py --arch resnext101-32x4d -b <batch_size> --training-only -p 1 --raport-file benchmark.json --amp --static-loss-scale 256 --epochs 1 --prof 100 --memory-format nhwc <path to imagenet>`
+        `python ./multiproc.py --nproc_per_node 8 ./launch.py --model resnext101-32x4d --precision AMP --mode benchmark_training --platform <DGX1V|DGXA100> <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`

 Each of these scripts will run 100 iterations and save results in the `benchmark.json` file.

-Batch size should be picked appropriately depending on the hardware configuration.
-
-| *Platform* | *Precision* | *Batch Size* |
-|:----------:|:-----------:|:------------:|
-| DGXA100    | AMP         | 128          |
-| DGXA100    | TF32        | 128          |
-| DGX-1      | AMP         | 128          |
-| DGX-1      | FP32        | 64           |
-
 #### Inference performance benchmark

 To benchmark inference, run:

-* FP32
+* FP32 (V100 GPUs only)

-`python ./main.py --arch resnext101-32x4d -b <batch_size> -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --evaluate <path to imagenet>`
+`python ./launch.py --model resnext101-32x4d --precision FP32 --mode benchmark_inference --platform DGX1V <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
+
+* TF32 (A100 GPUs only)
+
+`python ./launch.py --model resnext101-32x4d --precision FP32 --mode benchmark_inference --platform DGXA100 <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`

 * AMP

-`python ./main.py --arch resnext101-32x4d -b <batch_size> -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --evaluate --amp --memory-format nhwc <path to imagenet>`
+`python ./launch.py --model resnext101-32x4d --precision AMP --mode benchmark_inference --platform <DGX1V|DGXA100> <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`

 Each of these scripts will run 100 iterations and save results in the `benchmark.json` file.

-Batch size should be picked appropriately depending on the hardware configuration.
-
-| *Platform* | *Precision* | *Batch Size* |
-|:----------:|:-----------:|:------------:|
-| DGXA100    | AMP         | 128          |
-| DGXA100    | TF32        | 128          |
-| DGX-1      | AMP         | 128          |
-| DGX-1      | FP32        | 64           |
-
 ### Results

-Our results were obtained by running the applicable training script     in the pytorch-20.06 NGC container.
+Our results were obtained by running the applicable training script     in the pytorch-20.12 NGC container.

 To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).

 #### Training accuracy results

-##### Training accuracy: NVIDIA DGX A100 (8x A100 40GB)
+##### Training accuracy: NVIDIA DGX A100 (8x A100 80GB)
+
+| **Epochs** | **Mixed Precision Top1** | **TF32 Top1**  |
+|:----------:|:------------------------:|:--------------:|
+|     90     |      79.47 +/- 0.03      | 79.38 +/- 0.07 |
+|    250     |      80.19 +/- 0.08      | 80.27 +/- 0.1  |

-| **epochs** | **Mixed Precision Top1** | **TF32 Top1** |
-|:------:|:--------------------:|:--------------:|
-|   90   |    79.37 +/- 0.13    | 79.38 +/- 0.13 |

 ##### Training accuracy: NVIDIA DGX-1 (8x V100 16GB)

-| **epochs** | **Mixed Precision Top1** | **FP32 Top1** |
-|:-:|:-:|:-:|
-|   90   |    79.43 +/- 0.04    | 79.40 +/- 0.10 |
-| 250 | 79.92 +/- 0.13 | 80.06 +/- 0.06 |
-
+| **Epochs** | **Mixed Precision Top1** | **FP32 Top1**  |
+|:----------:|:------------------------:|:--------------:|
+|     90     |      79.49 +/- 0.05      | 79.40 +/- 0.10 |
+|    250     |      80.26 +/- 0.11      | 80.06 +/- 0.06 |


 ##### Example plots
@ -586,26 +560,29 @@ The following images show a 250 epochs configuration on a DGX-1V.

 #### Training performance results

-##### Training performance: NVIDIA DGX A100 (8x A100 40GB)
+##### Training performance: NVIDIA DGX A100 (8x A100 80GB)
+
+| **GPUs** | **Mixed Precision** |  **TF32**  | **Mixed Precision Speedup** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **TF32 Strong Scaling** | **TF32 Training Time (90E)** |
+|:--------:|:-------------------:|:----------:|:---------------------------:|:----------------------------------:|:---------------------------------------:|:-----------------------:|:----------------------------:|
+|    1     |     1169 img/s      | 420 img/s  |           2.77 x            |               1.0 x                |                ~29 hours                |          1.0 x          |          ~80 hours           |
+|    8     |     7399 img/s      | 3193 img/s |           2.31 x            |               6.32 x               |                ~5 hours                 |         7.58 x          |          ~11 hours           |

-|**GPUs**|**Mixed Precision**|  **TF32**   |**Mixed Precision Speedup**|**Mixed Precision Strong Scaling**|**Mixed Precision Training Time (90E)**|**TF32 Strong Scaling**|**TF32 Training Time (90E)**|
-|:------:|:-----------------:|:-----------:|:-------------------------:|:--------------------------------:|:-------------------------------------:|:---------------------:|:--------------------------:|
-|   1    |   908.40 img/s    |300.42 img/s |           3.02x           |              1.00x               |               ~37 hours               |         1.00x         |         ~111 hours         |
-|   8    |   6887.59 img/s   |2380.51 img/s|           2.89x           |              7.58x               |               ~5 hours                |         7.92x         |         ~14 hours          |

 ##### Training performance: NVIDIA DGX-1 16GB (8x V100 16GB)

-|**GPUs**|**Mixed Precision**|  **FP32**   |**Mixed Precision Speedup**|**Mixed Precision Strong Scaling**|**Mixed Precision Training Time (90E)**|**FP32 Strong Scaling**|**FP32 Training Time (90E)**|
-|:------:|:-----------------:|:-----------:|:-------------------------:|:--------------------------------:|:-------------------------------------:|:---------------------:|:--------------------------:|
-|   1    |   534.91 img/s    |150.05 img/s |           3.56x           |              1.00x               |               ~62 hours               |         1.00x         |         ~222 hours         |
-|   8    |   4000.79 img/s   |1151.01 img/s|           3.48x           |              7.48x               |               ~9 hours                |         7.67x         |         ~29 hours          |
+| **GPUs** | **Mixed Precision** |  **FP32**  | **Mixed Precision Speedup** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **FP32 Strong Scaling** | **FP32 Training Time (90E)** |
+|:--------:|:-------------------:|:----------:|:---------------------------:|:----------------------------------:|:---------------------------------------:|:-----------------------:|:----------------------------:|
+|    1     |      578 img/s      | 149 img/s  |           3.86 x            |               1.0 x                |                ~59 hours                |          1.0 x          |          ~225 hours          |
+|    8     |     3742 img/s      | 1117 img/s |           3.34 x            |               6.46 x               |                ~9 hours                 |         7.45 x          |          ~31 hours           |
+

 ##### Training performance: NVIDIA DGX-1 32GB (8x V100 32GB)

-|**GPUs**|**Mixed Precision**|  **FP32**   |**Mixed Precision Speedup**|**Mixed Precision Strong Scaling**|**Mixed Precision Training Time (90E)**|**FP32 Strong Scaling**|**FP32 Training Time (90E)**|
-|:------:|:-----------------:|:-----------:|:-------------------------:|:--------------------------------:|:-------------------------------------:|:---------------------:|:--------------------------:|
-|   1    |   516.07 img/s    |139.80 img/s |           3.69x           |              1.00x               |               ~65 hours               |         1.00x         |         ~238 hours         |
-|   8    |   3861.95 img/s   |1070.94 img/s|           3.61x           |              7.48x               |               ~9 hours                |         7.66x         |         ~31 hours          |
+| **GPUs** | **Mixed Precision** |  **FP32**  | **Mixed Precision Speedup** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **FP32 Strong Scaling** | **FP32 Training Time (90E)** |
+|:--------:|:-------------------:|:----------:|:---------------------------:|:----------------------------------:|:---------------------------------------:|:-----------------------:|:----------------------------:|
+|    1     |      556 img/s      | 151 img/s  |           3.68 x            |               1.0 x                |                ~61 hours                |          1.0 x          |          ~223 hours          |
+|    8     |     3595 img/s      | 1102 img/s |           3.26 x            |               6.45 x               |                ~10 hours                |         7.28 x          |          ~31 hours           |
+

 #### Inference performance results

@ -613,62 +590,64 @@ The following images show a 250 epochs configuration on a DGX-1V.

 ###### FP32 Inference Latency

-| **batch size** | **Throughput Avg** | **Latency Avg** | **Latency 90%** | **Latency 95%** | **Latency 99%** |
-|:-:|:-:|:-:|:-:|:-:|:-:|
-| 1 | 47.34 img/s | 21.02ms | 23.41ms | 24.55ms | 26.00ms |
-| 2 | 89.68 img/s | 22.14ms | 22.90ms | 24.86ms | 26.59ms |
-| 4 | 175.92 img/s | 22.57ms | 24.96ms | 25.53ms | 26.03ms |
-| 8 | 325.69 img/s | 24.35ms | 25.17ms | 25.80ms | 28.52ms |
-| 16 | 397.04 img/s | 40.04ms | 40.01ms | 40.08ms | 40.32ms |
-| 32 | 431.77 img/s | 73.71ms | 74.05ms | 74.09ms | 74.26ms |
-| 64 | 485.70 img/s | 131.04ms | 131.38ms | 131.53ms | 131.81ms |
-| 128 | N/A | N/A | N/A | N/A | N/A |
+| **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
+|:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
+|       1        |      55 img/s      |    18.48 ms     |    18.88 ms     |    20.74 ms     |
+|       2        |     116 img/s      |    17.54 ms     |    18.15 ms     |    21.32 ms     |
+|       4        |     214 img/s      |    19.07 ms     |    20.44 ms     |    22.69 ms     |
+|       8        |     291 img/s      |     27.8 ms     |    27.99 ms     |    28.47 ms     |
+|       16       |     354 img/s      |    45.78 ms     |     45.4 ms     |    45.73 ms     |
+|       32       |     423 img/s      |    77.13 ms     |    75.96 ms     |    76.21 ms     |
+|       64       |     486 img/s      |    134.92 ms    |    132.17 ms    |    132.51 ms    |
+|      128       |     523 img/s      |    252.11 ms    |    244.5 ms     |    244.99 ms    |
+|      256       |     530 img/s      |    499.64 ms    |    479.83 ms    |    481.41 ms    |
+

 ###### Mixed Precision Inference Latency

-| **batch size** | **Throughput Avg** | **Latency Avg** | **Latency 90%** | **Latency 95%** | **Latency 99%** |
-|:-:|:-:|:-:|:-:|:-:|:-:|
-| 1 | 43.11 img/s | 23.05ms | 25.19ms | 25.41ms | 26.63ms |
-| 2 | 83.29 img/s | 23.82ms | 25.11ms | 26.25ms | 27.29ms |
-| 4 | 173.67 img/s | 22.82ms | 24.38ms | 25.26ms | 25.92ms |
-| 8 | 330.18 img/s | 24.05ms | 26.45ms | 27.37ms | 27.74ms |
-| 16 | 634.82 img/s | 25.00ms | 26.93ms | 28.12ms | 28.73ms |
-| 32 | 884.91 img/s | 35.71ms | 35.96ms | 36.01ms | 36.13ms |
-| 64 | 998.40 img/s | 63.43ms | 63.63ms | 63.75ms | 63.96ms |
-| 128 | 1079.10 img/s | 117.74ms | 118.02ms | 118.11ms | 118.35ms |
-
+| **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
+|:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
+|       1        |      40 img/s      |    25.17 ms     |     28.4 ms     |    30.66 ms     |
+|       2        |      89 img/s      |    22.64 ms     |    24.29 ms     |    25.99 ms     |
+|       4        |     165 img/s      |    24.54 ms     |    26.23 ms     |    28.61 ms     |
+|       8        |     334 img/s      |    24.31 ms     |    28.46 ms     |    29.91 ms     |
+|       16       |     632 img/s      |     25.8 ms     |    27.76 ms     |    29.53 ms     |
+|       32       |     1219 img/s     |    27.35 ms     |    29.86 ms     |     31.6 ms     |
+|       64       |     1525 img/s     |    43.97 ms     |    42.01 ms     |    42.96 ms     |
+|      128       |     1647 img/s     |    82.22 ms     |    77.65 ms     |    79.56 ms     |
+|      256       |     1689 img/s     |    161.53 ms    |    151.25 ms    |    152.01 ms    |


 ##### Inference performance: NVIDIA T4

 ###### FP32 Inference Latency

-| **batch size** | **Throughput Avg** | **Latency Avg** | **Latency 90%** | **Latency 95%** | **Latency 99%** |
-|:-:|:-:|:-:|:-:|:-:|:-:|
-| 1 | 55.64 img/s | 17.88ms | 19.21ms | 20.35ms | 22.29ms |
-| 2 | 109.22 img/s | 18.24ms | 19.00ms | 20.43ms | 22.51ms |
-| 4 | 217.27 img/s | 18.26ms | 18.88ms | 19.51ms | 21.74ms |
-| 8 | 294.55 img/s | 26.74ms | 27.35ms | 27.62ms | 28.93ms |
-| 16 | 351.30 img/s | 45.34ms | 45.72ms | 46.10ms | 47.43ms |
-| 32 | 401.97 img/s | 79.10ms | 79.37ms | 79.44ms | 81.83ms |
-| 64 | 449.30 img/s | 140.30ms | 140.73ms | 141.26ms | 143.57ms |
-| 128 | N/A | N/A | N/A | N/A | N/A |
+| **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
+|:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
+|       1        |      79 img/s      |    13.07 ms     |    14.66 ms     |    15.59 ms     |
+|       2        |     119 img/s      |    17.21 ms     |    18.07 ms     |    19.78 ms     |
+|       4        |     141 img/s      |    28.65 ms     |    28.62 ms     |    28.77 ms     |
+|       8        |     139 img/s      |    57.84 ms     |    58.29 ms     |    58.62 ms     |
+|       16       |     153 img/s      |    104.8 ms     |    105.65 ms    |    106.2 ms     |
+|       32       |     178 img/s      |    181.24 ms    |    180.96 ms    |    181.57 ms    |
+|       64       |     179 img/s      |    360.93 ms    |    358.22 ms    |    359.11 ms    |
+|      128       |     177 img/s      |    735.99 ms    |    726.15 ms    |    727.81 ms    |
+|      256       |     167 img/s      |   1561.91 ms    |   1523.52 ms    |   1525.96 ms    |
+

 ###### Mixed Precision Inference Latency

-| **batch size** | **Throughput Avg** | **Latency Avg** | **Latency 90%** | **Latency 95%** | **Latency 99%** |
-|:-:|:-:|:-:|:-:|:-:|:-:|
-| 1 | 51.14 img/s | 19.48ms | 20.16ms | 21.40ms | 26.21ms |
-| 2 | 102.29 img/s | 19.44ms | 19.77ms | 20.42ms | 24.51ms |
-| 4 | 209.44 img/s | 18.93ms | 19.52ms | 20.23ms | 21.95ms |
-| 8 | 408.69 img/s | 19.47ms | 21.12ms | 23.15ms | 25.77ms |
-| 16 | 641.78 img/s | 24.54ms | 25.19ms | 25.64ms | 27.31ms |
-| 32 | 800.26 img/s | 39.28ms | 39.43ms | 39.54ms | 41.96ms |
-| 64 | 883.66 img/s | 71.76ms | 71.87ms | 71.94ms | 72.78ms |
-| 128 | 948.27 img/s | 134.19ms | 134.40ms | 134.58ms | 134.81ms |
-
-
-
+| **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
+|:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
+|       1        |      65 img/s      |    15.69 ms     |    16.95 ms     |    17.97 ms     |
+|       2        |     126 img/s      |     16.2 ms     |    16.78 ms     |     18.6 ms     |
+|       4        |     245 img/s      |    16.77 ms     |    18.35 ms     |    25.88 ms     |
+|       8        |     488 img/s      |    16.82 ms     |    17.86 ms     |    25.45 ms     |
+|       16       |     541 img/s      |    30.16 ms     |    29.95 ms     |    30.18 ms     |
+|       32       |     566 img/s      |    57.79 ms     |    57.11 ms     |    57.29 ms     |
+|       64       |     580 img/s      |    112.84 ms    |    111.07 ms    |    111.56 ms    |
+|      128       |     586 img/s      |    224.75 ms    |    219.12 ms    |    219.64 ms    |
+|      256       |     589 img/s      |    447.25 ms    |    434.18 ms    |    439.22 ms    |


 ## Release notes
@ -680,9 +659,10 @@ The following images show a 250 epochs configuration on a DGX-1V.
 2. July 2020
  * Added A100 scripts
  * Updated README
+3. February 2021
+  * Moved from APEX AMP to Native AMP

 ### Known issues

 There are no known issues with this model.

-
--- a/PyTorch/Classification/ConvNets/resnext101-32x4d/training/AMP/DGX1V_resnext101-32x4d_AMP_250E.sh
+++ b/PyTorch/Classification/ConvNets/resnext101-32x4d/training/AMP/DGX1V_resnext101-32x4d_AMP_250E.sh
@ -0,0 +1 @@
+python ./launch.py --model resnext101-32x4d --precision AMP --mode convergence --platform DGX1V /imagenet --workspace ${1:-./} --raport-file raport.json
--- a/PyTorch/Classification/ConvNets/resnext101-32x4d/training/AMP/DGX1V_resnext101-32x4d_AMP_90E.sh
+++ b/PyTorch/Classification/ConvNets/resnext101-32x4d/training/AMP/DGX1V_resnext101-32x4d_AMP_90E.sh
@ -0,0 +1 @@
+python ./launch.py --model resnext101-32x4d --precision AMP --mode convergence --platform DGX1V /imagenet --epochs 90 --mixup 0.0 --workspace ${1:-./} --raport-file raport.json
--- a/PyTorch/Classification/ConvNets/resnext101-32x4d/training/AMP/DGX1_RNXT101-32x4d_AMP_250E.sh
+++ b/PyTorch/Classification/ConvNets/resnext101-32x4d/training/AMP/DGX1_RNXT101-32x4d_AMP_250E.sh
@ -1 +0,0 @@
-python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j8 -p 100 --data-backend dali-cpu --arch resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 128 --amp --static-loss-scale 128 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs 250 --warmup 8 --wd 6.103515625e-05 --mixup 0.2 --memory-format nhwc
--- a/PyTorch/Classification/ConvNets/resnext101-32x4d/training/AMP/DGX1_RNXT101-32x4d_AMP_90E.sh
+++ b/PyTorch/Classification/ConvNets/resnext101-32x4d/training/AMP/DGX1_RNXT101-32x4d_AMP_90E.sh
@ -1 +0,0 @@
-python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j8 -p 100 --data-backend dali-cpu --arch resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 128 --amp --static-loss-scale 128 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs  90 --warmup 8 --wd 6.103515625e-05 --memory-format nhwc
--- a/PyTorch/Classification/ConvNets/resnext101-32x4d/training/AMP/DGXA100_RNXT101-32x4d_AMP_90E.sh
+++ b/PyTorch/Classification/ConvNets/resnext101-32x4d/training/AMP/DGXA100_RNXT101-32x4d_AMP_90E.sh
@ -1 +0,0 @@
-python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j16 -p 100 --data-backend dali-cpu --arch resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 128 --amp --static-loss-scale 128 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs  90 --warmup 8 --wd 6.103515625e-05 --memory-format nhwc
--- a/PyTorch/Classification/ConvNets/resnext101-32x4d/training/AMP/DGXA100_resnext101-32x4d_AMP_250E.sh
+++ b/PyTorch/Classification/ConvNets/resnext101-32x4d/training/AMP/DGXA100_resnext101-32x4d_AMP_250E.sh
@ -0,0 +1 @@
+python ./launch.py --model resnext101-32x4d --precision AMP --mode convergence --platform DGXA100 /imagenet --workspace ${1:-./} --raport-file raport.json
--- a/PyTorch/Classification/ConvNets/resnext101-32x4d/training/AMP/DGXA100_resnext101-32x4d_AMP_90E.sh
+++ b/PyTorch/Classification/ConvNets/resnext101-32x4d/training/AMP/DGXA100_resnext101-32x4d_AMP_90E.sh
@ -0,0 +1 @@
+python ./launch.py --model resnext101-32x4d --precision AMP --mode convergence --platform DGXA100 /imagenet --epochs 90 --mixup 0.0 --workspace ${1:-./} --raport-file raport.json
--- a/PyTorch/Classification/ConvNets/resnext101-32x4d/training/FP32/DGX1V_resnext101-32x4d_FP32_250E.sh
+++ b/PyTorch/Classification/ConvNets/resnext101-32x4d/training/FP32/DGX1V_resnext101-32x4d_FP32_250E.sh
@ -0,0 +1 @@
+python ./launch.py --model resnext101-32x4d --precision FP32 --mode convergence --platform DGX1V /imagenet --workspace ${1:-./} --raport-file raport.json
--- a/PyTorch/Classification/ConvNets/resnext101-32x4d/training/FP32/DGX1V_resnext101-32x4d_FP32_90E.sh
+++ b/PyTorch/Classification/ConvNets/resnext101-32x4d/training/FP32/DGX1V_resnext101-32x4d_FP32_90E.sh
@ -0,0 +1 @@
+python ./launch.py --model resnext101-32x4d --precision FP32 --mode convergence --platform DGX1V /imagenet --epochs 90 --mixup 0.0 --workspace ${1:-./} --raport-file raport.json
--- a/PyTorch/Classification/ConvNets/resnext101-32x4d/training/FP32/DGX1_RNXT101-32x4d_FP32_250E.sh
+++ b/PyTorch/Classification/ConvNets/resnext101-32x4d/training/FP32/DGX1_RNXT101-32x4d_FP32_250E.sh
@ -1 +0,0 @@
-python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j8 -p 100 --data-backend dali-cpu --arch resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 64 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs 250 --warmup 8 --wd 6.103515625e-05 --mixup 0.2
--- a/PyTorch/Classification/ConvNets/resnext101-32x4d/training/FP32/DGX1_RNXT101-32x4d_FP32_90E.sh
+++ b/PyTorch/Classification/ConvNets/resnext101-32x4d/training/FP32/DGX1_RNXT101-32x4d_FP32_90E.sh
@ -1 +0,0 @@
-python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j8 -p 100 --data-backend dali-cpu --arch resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 64 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs  90 --warmup 8 --wd 6.103515625e-05
--- a/PyTorch/Classification/ConvNets/resnext101-32x4d/training/TF32/DGXA100_RNXT101-32x4d_TF32_90E.sh
+++ b/PyTorch/Classification/ConvNets/resnext101-32x4d/training/TF32/DGXA100_RNXT101-32x4d_TF32_90E.sh
@ -1 +0,0 @@
-python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j16 -p 100 --data-backend dali-cpu --arch resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 128 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs  90 --warmup 8 --wd 6.103515625e-05
--- a/PyTorch/Classification/ConvNets/resnext101-32x4d/training/TF32/DGXA100_resnext101-32x4d_TF32_250E.sh
+++ b/PyTorch/Classification/ConvNets/resnext101-32x4d/training/TF32/DGXA100_resnext101-32x4d_TF32_250E.sh
@ -0,0 +1 @@
+python ./launch.py --model resnext101-32x4d --precision TF32 --mode convergence --platform DGXA100 /imagenet --workspace ${1:-./} --raport-file raport.json
--- a/PyTorch/Classification/ConvNets/resnext101-32x4d/training/TF32/DGXA100_resnext101-32x4d_TF32_90E.sh
+++ b/PyTorch/Classification/ConvNets/resnext101-32x4d/training/TF32/DGXA100_resnext101-32x4d_TF32_90E.sh
@ -0,0 +1 @@
+python ./launch.py --model resnext101-32x4d --precision TF32 --mode convergence --platform DGXA100 /imagenet --epochs 90 --mixup 0.0 --workspace ${1:-./} --raport-file raport.json
--- a/PyTorch/Classification/ConvNets/se-resnext101-32x4d/README.md
+++ b/PyTorch/Classification/ConvNets/se-resnext101-32x4d/README.md
@ -31,11 +31,11 @@ achieve state-of-the-art accuracy, and is tested and maintained by NVIDIA.
    * [Inference performance benchmark](#inference-performance-benchmark)
  * [Results](#results)
    * [Training accuracy results](#training-accuracy-results)
-      * [Training accuracy: NVIDIA DGX A100 (8x A100 40GB)](#training-accuracy-nvidia-dgx-a100-8x-a100-40gb)
+      * [Training accuracy: NVIDIA DGX A100 (8x A100 80GB)](#training-accuracy-nvidia-dgx-a100-8x-a100-80gb)
      * [Training accuracy: NVIDIA DGX-1 (8x V100 16GB)](#training-accuracy-nvidia-dgx-1-8x-v100-16gb)
      * [Example plots](#example-plots)
    * [Training performance results](#training-performance-results)
-      * [Training performance: NVIDIA DGX A100 (8x A100 40GB)](#training-performance-nvidia-dgx-a100-8x-a100-40gb)
+      * [Training performance: NVIDIA DGX A100 (8x A100 80GB)](#training-performance-nvidia-dgx-a100-8x-a100-80gb)
      * [Training performance: NVIDIA DGX-1 16GB (8x V100 16GB)](#training-performance-nvidia-dgx-1-16gb-8x-v100-16gb)
      * [Training performance: NVIDIA DGX-1 32GB (8x V100 32GB)](#training-performance-nvidia-dgx-1-32gb-8x-v100-32gb)
  * [Inference performance results](#inference-performance-results)
@ -45,7 +45,6 @@ achieve state-of-the-art accuracy, and is tested and maintained by NVIDIA.
  * [Changelog](#changelog)
  * [Known issues](#known-issues)

-
 ## Model overview

 The SE-ResNeXt101-32x4d is a [ResNeXt101-32x4d](https://arxiv.org/pdf/1611.05431.pdf)
@ -106,13 +105,14 @@ This model uses the following data augmentation:
  * Scale to 256x256
  * Center crop to 224x224

+
 ### Feature support matrix

 The following features are supported by this model:

-| Feature               | ResNeXt101-32x4d
+| Feature               | SE-ResNeXt101-32x4d
 |-----------------------|--------------------------
-|[DALI](https://docs.nvidia.com/deeplearning/dali/release-notes/index.html)   |   Yes
+|[DALI](https://docs.nvidia.com/deeplearning/sdk/dali-release-notes/index.html)   |   Yes
 |[APEX AMP](https://nvidia.github.io/apex/amp.html) | Yes |

 #### Features
@ -129,11 +129,11 @@ which speeds up data loading when CPU becomes a bottleneck.
 DALI can use CPU or GPU, and outperforms the PyTorch native dataloader.

 Run training with `--data-backends dali-gpu` or `--data-backends dali-cpu` to enable DALI.
-For ResNeXt101-32x4d, for DGX1 and DGX2 we recommend `--data-backends dali-cpu`.
+For DGXA100 and DGX1 we recommend `--data-backends dali-cpu`.

 ### Mixed precision training

-Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores) in the Volta and Turing architecture, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:
+Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores) in Volta, and following with both the Turing and Ampere architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:
 1.  Porting the model to use the FP16 data type where appropriate.
 2.  Adding loss scaling to preserve small gradient values.

@ -191,7 +191,7 @@ The following section lists the requirements that you need to meet in order to s
 This repository contains Dockerfile which extends the PyTorch NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:

 * [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
-* [PyTorch 20.06-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch) or newer
+* [PyTorch 20.12-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch) or newer
 * Supported GPUs:
    * [NVIDIA Volta architecture](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
    * [NVIDIA Turing architecture](https://www.nvidia.com/en-us/geforce/turing/)
@ -216,7 +216,7 @@ cd DeepLearningExamples/PyTorch/Classification/

 ### 2. Download and preprocess the dataset.

-The ResNeXt101-32x4d script operates on ImageNet 1k, a widely popular image classification dataset from the ILSVRC challenge.
+The SE-ResNeXt101-32x4d script operates on ImageNet 1k, a widely popular image classification dataset from the ILSVRC challenge.

 PyTorch can work directly on JPEGs, therefore, preprocessing/augmentation is not needed.

@ -243,27 +243,28 @@ For the specifics concerning training and inference, see the [Advanced](#advance

 The directory in which the `train/` and `val/` directories are placed, is referred to as `<path to imagenet>` in this document.

-### 3. Build the SE-RNXT101-32x4d PyTorch NGC container.
+### 3. Build the SE-ResNeXt101-32x4d PyTorch NGC container.

 ```
-docker build . -t nvidia_se-rnxt101-32x4d
+docker build . -t nvidia_se-resnext101-32x4d
 ```

 ### 4. Start an interactive session in the NGC container to run training/inference.
 ```
-nvidia-docker run --rm -it -v <path to imagenet>:/imagenet --ipc=host nvidia_se-rnxt101-32x4d
+nvidia-docker run --rm -it -v <path to imagenet>:/imagenet --ipc=host nvidia_se-resnext101-32x4d
 ```

+
 ### 5. Start training

-To run training for a standard configuration (DGXA100/DGX1/DGX2, AMP/TF32/FP32, 90/250 Epochs),
+To run training for a standard configuration (DGXA100/DGX1V, AMP/TF32/FP32, 90/250 Epochs),
 run one of the scripts in the `./se-resnext101-32x4d/training` directory
-called `./se-resnext101-32x4d/training/{AMP, TF32, FP32}/{DGXA100, DGX1, DGX2}_SE-RNXT101-32x4d_{AMP, TF32, FP32}_{90,250}E.sh`.
+called `./se-resnext101-32x4d/training/{AMP, TF32, FP32}/{ DGXA100, DGX1V }_se-resnext101-32x4d_{AMP, TF32, FP32}_{ 90, 250 }E.sh`.

 Ensure ImageNet is mounted in the `/imagenet` directory.

 Example:
-    `bash ./se-resnext101-32x4d/training/AMP/DGX1_SE-RNXT101-32x4d_AMP_250E.sh <path were to store checkpoints and logs>`
+    `bash ./se-resnext101-32x4d/training/AMP/DGX1_se-resnext101-32x4d_AMP_250E.sh <path were to store checkpoints and logs>`

 ### 6. Start inference

@ -281,7 +282,7 @@ To run inference on ImageNet, run:

 To run inference on JPEG image using pretrained weights:

-`python classify.py --arch se-resnext101-32x4d -c fanin --weights nvidia_se-resnext101-32x4d_200821.pth.tar  --precision AMP|FP32 --image <path to JPEG image>`
+`python classify.py --arch se-resnext101-32x4d -c fanin --weights nvidia_se-resnext101-32x4d_200821.pth.tar --precision AMP|FP32 --image <path to JPEG image>`


 ## Advanced
@ -320,7 +321,7 @@ usage: main.py [-h] [--data-backend BACKEND] [--arch ARCH]
               [--lr-schedule SCHEDULE] [--warmup E] [--label-smoothing S]
               [--mixup ALPHA] [--momentum M] [--weight-decay W]
               [--bn-weight-decay] [--nesterov] [--print-freq N]
-               [--resume PATH] [--pretrained-weights PATH] [--fp16]
+               [--resume PATH] [--pretrained-weights PATH]
               [--static-loss-scale STATIC_LOSS_SCALE] [--dynamic-loss-scale]
               [--prof N] [--amp] [--seed SEED] [--gather-checkpoints]
               [--raport-file RAPORT_FILE] [--evaluate] [--training-only]
@ -339,8 +340,10 @@ optional arguments:
                        data backend: pytorch | syntetic | dali-gpu | dali-cpu
                        (default: dali-cpu)
  --arch ARCH, -a ARCH  model architecture: resnet18 | resnet34 | resnet50 |
-                        resnet101 | resnet152 | resnext101-32x4d | se-
-                        resnext101-32x4d (default: resnet50)
+                        resnet101 | resnet152 | resnext50-32x4d |
+                        resnext101-32x4d | resnext101-32x8d |
+                        resnext101-32x8d-basic | se-resnext101-32x4d (default:
+                        resnet50)
  --model-config CONF, -c CONF
                        model configs: classic | fanin | grp-fanin | grp-
                        fanout(default: classic)
@ -369,10 +372,9 @@ optional arguments:
  --resume PATH         path to latest checkpoint (default: none)
  --pretrained-weights PATH
                        load weights from here
-  --fp16                Run model fp16 mode.
  --static-loss-scale STATIC_LOSS_SCALE
                        Static loss scale, positive power of 2 values can
-                        improve fp16 convergence.
+                        improve amp convergence.
  --dynamic-loss-scale  Use dynamic loss scaling. If supplied, this argument
                        supersedes --static-loss-scale.
  --prof N              Run only N iterations
@ -390,6 +392,7 @@ optional arguments:
  --workspace DIR       path to directory where checkpoints will be stored
  --memory-format {nchw,nhwc}
                        memory layout, nchw or nhwc
+
 ```


@ -400,25 +403,7 @@ To use your own dataset, divide it in directories as in the following scheme:
 - Training images - `train/<class id>/<image>`
 - Validation images - `val/<class id>/<image>`

-If your dataset's has number of classes different than 1000, you need to add a custom config
-in the `image_classification/resnet.py` file.
-
-```python
-resnet_versions = {
-    ...
-    'se-resnext101-32x4d-custom' : {
-        'net' : ResNet,
-        'block' : SEBottleneck,
-        'cardinality' : 32,
-        'layers' : [3, 4, 23, 3],
-        'widths' : [128, 256, 512, 1024],
-        'expansion' : 2,
-        'num_classes' : <custom number of classes>,
-    }
-}
-```
-
-After adding the config, run the training script with `--arch resnext101-32x4d-custom` flag.
+If your dataset's has number of classes different than 1000, you need to pass `--num-classes N` flag to the training script.

 ### Training process

@ -441,7 +426,7 @@ To restart training from checkpoint use `--resume` option.
 To start training from pretrained weights (e.g. downloaded from NGC) use `--pretrained-weights` option.

 The difference between those two is that the pretrained weights contain only model weights,
-and checkpoints, apart from model weights, contain optimizer state, LR scheduler state, RNG state.
+and checkpoints, apart from model weights, contain optimizer state, LR scheduler state.

 Checkpoints are suitable for dividing the training into parts, for example in order
 to divide the training job into shorter stages, or restart training after infrastructure fail.
@ -487,14 +472,13 @@ wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/seresnext

 unzip seresnext101_32x4d_pyt_amp_20.06.0.zip
 ```
-
 To run inference on ImageNet, run:

 `python ./main.py --arch se-resnext101-32x4d --evaluate --epochs 1 --pretrained-weights nvidia_se-resnext101-32x4d_200821.pth.tar -b <batch size> <path to imagenet>`

 To run inference on JPEG image using pretrained weights:

-`python classify.py --arch se-resnext101-32x4d -c fanin --weights nvidia_se-resnext101-32x4d_200821.pth.tar  --precision AMP|FP32 --image <path to JPEG image>`
+`python classify.py --arch se-resnext101-32x4d --weights nvidia_se-resnext101-32x4d_200821.pth.tar --precision AMP|FP32 --image <path to JPEG image>`


 ## Performance
@ -508,71 +492,62 @@ The following section shows how to run benchmarks measuring the model performanc
 To benchmark training, run:

 * For 1 GPU
-    * FP32
-`python ./main.py --arch se-resnext101-32x4d -b <batch_size> --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 <path to imagenet>`
+    * FP32 (V100 GPUs only)
+        `python ./launch.py --model se-resnext101-32x4d --precision FP32 --mode benchmark_training --platform DGX1V <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
+    * TF32 (A100 GPUs only)
+        `python ./launch.py --model se-resnext101-32x4d --precision TF32 --mode benchmark_training --platform DGXA100 <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
    * AMP
-`python ./main.py --arch se-resnext101-32x4d -b <batch_size> --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --amp --static-loss-scale 256 --memory-format nhwc <path to imagenet>`
+        `python ./launch.py --model se-resnext101-32x4d --precision AMP --mode benchmark_training --platform <DGX1V|DGXA100> <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
 * For multiple GPUs
-    * FP32
-`python ./multiproc.py --nproc_per_node 8 ./main.py --arch se-resnext101-32x4d -b <batch_size> --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 <path to imagenet>`
+    * FP32 (V100 GPUs only)
+        `python ./launch.py --model se-resnext101-32x4d --precision FP32 --mode benchmark_training --platform DGX1V <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
+    * TF32 (A100 GPUs only)
+        `python ./multiproc.py --nproc_per_node 8 ./launch.py --model se-resnext101-32x4d --precision TF32 --mode benchmark_training --platform DGXA100 <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
    * AMP
-`python ./multiproc.py --nproc_per_node 8 ./main.py --arch se-resnext101-32x4d -b <batch_size> --training-only -p 1 --raport-file benchmark.json --amp --static-loss-scale 256 --memory-format nhwc --epochs 1 --prof 100 <path to imagenet>`
+        `python ./multiproc.py --nproc_per_node 8 ./launch.py --model se-resnext101-32x4d --precision AMP --mode benchmark_training --platform <DGX1V|DGXA100> <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`

 Each of these scripts will run 100 iterations and save results in the `benchmark.json` file.

-Batch size should be picked appropriately depending on the hardware configuration.
-
-| *Platform* | *Precision* | *Batch Size* |
-|:----------:|:-----------:|:------------:|
-| DGXA100    | AMP         | 128          |
-| DGXA100    | TF32        | 128          |
-| DGX-1      | AMP         | 128          |
-| DGX-1      | FP32        | 64           |
-
 #### Inference performance benchmark

 To benchmark inference, run:

-* FP32
+* FP32 (V100 GPUs only)

-`python ./main.py --arch se-resnext101-32x4d -b <batch_size> -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --evaluate <path to imagenet>`
+`python ./launch.py --model se-resnext101-32x4d --precision FP32 --mode benchmark_inference --platform DGX1V <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
+
+* TF32 (A100 GPUs only)
+
+`python ./launch.py --model se-resnext101-32x4d --precision FP32 --mode benchmark_inference --platform DGXA100 <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`

 * AMP

-`python ./main.py --arch se-resnext101-32x4d -b <batch_size> -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --evaluate --amp --memory-format nhwc <path to imagenet>`
+`python ./launch.py --model se-resnext101-32x4d --precision AMP --mode benchmark_inference --platform <DGX1V|DGXA100> <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`

 Each of these scripts will run 100 iterations and save results in the `benchmark.json` file.

-Batch size should be picked appropriately depending on the hardware configuration.
-
-| *Platform* | *Precision* | *Batch Size* |
-|:----------:|:-----------:|:------------:|
-| DGXA100    | AMP         | 128          |
-| DGXA100    | TF32        | 128          |
-| DGX-1      | AMP         | 128          |
-| DGX-1      | FP32        | 64           |
-
 ### Results

-Our results were obtained by running the applicable training script     in the pytorch-20.06 NGC container.
+Our results were obtained by running the applicable training script     in the pytorch-20.12 NGC container.

 To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).

 #### Training accuracy results

-##### Training accuracy: NVIDIA DGX A100 (8x A100 40GB)
+##### Training accuracy: NVIDIA DGX A100 (8x A100 80GB)
+
+| **Epochs** | **Mixed Precision Top1** | **TF32 Top1**  |
+|:----------:|:------------------------:|:--------------:|
+|     90     |      80.03 +/- 0.11      | 79.92 +/- 0.07 |
+|    250     |      80.9 +/- 0.08       | 80.98 +/- 0.07 |

-| **epochs** | **Mixed Precision Top1** | **TF32 Top1** |
-|:------:|:--------------------:|:--------------:|
-|   90   |    79.95 +/- 0.09    | 79.97 +/- 0.08 |

 ##### Training accuracy: NVIDIA DGX-1 (8x V100 16GB)

-| **epochs** | **Mixed Precision Top1** | **FP32 Top1** |
-|:-:|:-:|:-:|
-|   90   |    80.04 +/- 0.10    | 79.93 +/- 0.10 |
-| 250 | 80.96 +/- 0.04 | 80.97 +/- 0.09 |
-
+| **Epochs** | **Mixed Precision Top1** | **FP32 Top1**  |
+|:----------:|:------------------------:|:--------------:|
+|     90     |      80.04 +/- 0.07      | 79.93 +/- 0.10 |
+|    250     |      80.92 +/- 0.09      | 80.97 +/- 0.09 |


 ##### Example plots
@ -587,26 +562,29 @@ The following images show a 250 epochs configuration on a DGX-1V.

 #### Training performance results

-##### Training performance: NVIDIA DGX A100 (8x A100 40GB)
+##### Training performance: NVIDIA DGX A100 (8x A100 80GB)
+
+| **GPUs** | **Mixed Precision** |  **TF32**  | **Mixed Precision Speedup** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **TF32 Strong Scaling** | **TF32 Training Time (90E)** |
+|:--------:|:-------------------:|:----------:|:---------------------------:|:----------------------------------:|:---------------------------------------:|:-----------------------:|:----------------------------:|
+|    1     |      804 img/s      | 360 img/s  |           2.22 x            |               1.0 x                |                ~42 hours                |          1.0 x          |          ~94 hours           |
+|    8     |     5248 img/s      | 2665 img/s |           1.96 x            |               6.52 x               |                ~7 hours                 |         7.38 x          |          ~13 hours           |

-|**GPUs**|**Mixed Precision**|  **TF32**   |**Mixed Precision Speedup**|**Mixed Precision Strong Scaling**|**Mixed Precision Training Time (90E)**|**TF32 Strong Scaling**|**TF32 Training Time (90E)**|
-|:------:|:-----------------:|:-----------:|:-------------------------:|:--------------------------------:|:-------------------------------------:|:---------------------:|:--------------------------:|
-|   1    |   641.57 img/s    |258.75 img/s |           2.48x           |              1.00x               |               ~52 hours               |         1.00x         |         ~129 hours         |
-|   8    |   4758.40 img/s   |2038.03 img/s|           2.33x           |              7.42x               |               ~7 hours                |         7.88x         |         ~17 hours          |

 ##### Training performance: NVIDIA DGX-1 16GB (8x V100 16GB)

-|**GPUs**|**Mixed Precision**|  **FP32**  |**Mixed Precision Speedup**|**Mixed Precision Strong Scaling**|**Mixed Precision Training Time (90E)**|**FP32 Strong Scaling**|**FP32 Training Time (90E)**|
-|:------:|:-----------------:|:----------:|:-------------------------:|:--------------------------------:|:-------------------------------------:|:---------------------:|:--------------------------:|
-|   1    |   383.15 img/s    |130.48 img/s|           2.94x           |              1.00x               |               ~87 hours               |         1.00x         |         ~255 hours         |
-|   8    |   2695.10 img/s   |996.04 img/s|           2.71x           |              7.03x               |               ~13 hours               |         7.63x         |         ~34 hours          |
+| **GPUs** | **Mixed Precision** | **FP32**  | **Mixed Precision Speedup** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **FP32 Strong Scaling** | **FP32 Training Time (90E)** |
+|:--------:|:-------------------:|:---------:|:---------------------------:|:----------------------------------:|:---------------------------------------:|:-----------------------:|:----------------------------:|
+|    1     |      430 img/s      | 133 img/s |           3.21 x            |               1.0 x                |                ~79 hours                |          1.0 x          |          ~252 hours          |
+|    8     |     2716 img/s      | 994 img/s |           2.73 x            |               6.31 x               |                ~13 hours                |         7.42 x          |          ~34 hours           |
+

 ##### Training performance: NVIDIA DGX-1 32GB (8x V100 32GB)

-|**GPUs**|**Mixed Precision**|  **FP32**  |**Mixed Precision Speedup**|**Mixed Precision Strong Scaling**|**Mixed Precision Training Time (90E)**|**FP32 Strong Scaling**|**FP32 Training Time (90E)**|
-|:------:|:-----------------:|:----------:|:-------------------------:|:--------------------------------:|:-------------------------------------:|:---------------------:|:--------------------------:|
-|   1    |   364.65 img/s    |123.46 img/s|           2.95x           |              1.00x               |               ~92 hours               |         1.00x         |         ~270 hours         |
-|   8    |   2540.49 img/s   |959.94 img/s|           2.65x           |              6.97x               |               ~13 hours               |         7.78x         |         ~35 hours          |
+| **GPUs** | **Mixed Precision** |  **FP32**  | **Mixed Precision Speedup** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **FP32 Strong Scaling** | **FP32 Training Time (90E)** |
+|:--------:|:-------------------:|:----------:|:---------------------------:|:----------------------------------:|:---------------------------------------:|:-----------------------:|:----------------------------:|
+|    1     |      413 img/s      | 134 img/s  |           3.08 x            |               1.0 x                |                ~82 hours                |          1.0 x          |          ~251 hours          |
+|    8     |     2572 img/s      | 1011 img/s |           2.54 x            |               6.22 x               |                ~14 hours                |         7.54 x          |          ~34 hours           |
+

 #### Inference performance results

@ -614,62 +592,65 @@ The following images show a 250 epochs configuration on a DGX-1V.

 ###### FP32 Inference Latency

-| **batch size** | **Throughput Avg** | **Latency Avg** | **Latency 90%** | **Latency 95%** | **Latency 99%** |
-|:-:|:-:|:-:|:-:|:-:|:-:|
-| 1 | 33.58 img/s | 29.72ms | 30.92ms | 31.77ms | 34.65ms |
-| 2 | 66.47 img/s | 29.94ms | 31.30ms | 32.74ms | 34.79ms |
-| 4 | 135.31 img/s | 29.36ms | 29.78ms | 32.61ms | 33.90ms |
-| 8 | 261.52 img/s | 30.42ms | 32.73ms | 33.99ms | 35.61ms |
-| 16 | 356.05 img/s | 44.61ms | 44.93ms | 45.17ms | 46.90ms |
-| 32 | 391.83 img/s | 80.91ms | 81.28ms | 81.64ms | 82.69ms |
-| 64 | 443.91 img/s | 142.70ms | 142.99ms | 143.46ms | 145.01ms |
-| 128 | N/A | N/A | N/A | N/A | N/A |
+| **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
+|:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
+|       1        |      37 img/s      |    26.81 ms     |    27.89 ms     |    31.44 ms     |
+|       2        |      75 img/s      |    27.01 ms     |    28.89 ms     |    31.17 ms     |
+|       4        |     144 img/s      |    28.09 ms     |    30.14 ms     |    32.47 ms     |
+|       8        |     259 img/s      |    31.23 ms     |    33.65 ms     |     38.4 ms     |
+|       16       |     332 img/s      |     48.7 ms     |    48.35 ms     |     48.8 ms     |
+|       32       |     394 img/s      |    83.02 ms     |    81.55 ms     |     81.9 ms     |
+|       64       |     471 img/s      |    138.88 ms    |    136.24 ms    |    136.54 ms    |
+|      128       |     505 img/s      |    261.4 ms     |    253.07 ms    |    254.29 ms    |
+|      256       |     513 img/s      |    516.66 ms    |    496.06 ms    |    497.05 ms    |
+

 ###### Mixed Precision Inference Latency

-| **batch size** | **Throughput Avg** | **Latency Avg** | **Latency 90%** | **Latency 95%** | **Latency 99%** |
-|:-:|:-:|:-:|:-:|:-:|:-:|
-| 1 | 35.08 img/s | 28.40ms | 29.75ms | 31.77ms | 35.85ms |
-| 2 | 68.85 img/s | 28.92ms | 30.24ms | 31.46ms | 37.07ms |
-| 4 | 131.78 img/s | 30.17ms | 31.39ms | 32.66ms | 37.17ms |
-| 8 | 260.21 img/s | 30.52ms | 31.20ms | 32.92ms | 34.46ms |
-| 16 | 506.62 img/s | 31.36ms | 32.48ms | 34.13ms | 36.49ms |
-| 32 | 778.92 img/s | 40.69ms | 40.90ms | 41.07ms | 43.67ms |
-| 64 | 880.49 img/s | 72.10ms | 72.29ms | 72.34ms | 76.46ms |
-| 128 | 977.86 img/s | 130.19ms | 130.34ms | 130.41ms | 131.12ms |
-
+| **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
+|:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
+|       1        |      29 img/s      |    34.24 ms     |    36.67 ms     |     39.4 ms     |
+|       2        |      53 img/s      |    37.81 ms     |    43.03 ms     |     45.1 ms     |
+|       4        |     103 img/s      |     39.1 ms     |    43.05 ms     |    46.16 ms     |
+|       8        |     226 img/s      |    35.66 ms     |    38.39 ms     |    41.13 ms     |
+|       16       |     458 img/s      |     35.4 ms     |    37.38 ms     |    39.97 ms     |
+|       32       |     882 img/s      |    37.37 ms     |    40.12 ms     |    42.64 ms     |
+|       64       |     1356 img/s     |    49.31 ms     |    47.21 ms     |    49.87 ms     |
+|      112       |     1448 img/s     |    81.27 ms     |    77.35 ms     |    78.28 ms     |
+|      128       |     1486 img/s     |    90.59 ms     |    86.15 ms     |    87.04 ms     |
+|      256       |     1534 img/s     |    176.72 ms    |    166.2 ms     |    167.53 ms    |


 ##### Inference performance: NVIDIA T4

 ###### FP32 Inference Latency

-| **batch size** | **Throughput Avg** | **Latency Avg** | **Latency 90%** | **Latency 95%** | **Latency 99%** |
-|:-:|:-:|:-:|:-:|:-:|:-:|
-| 1 | 40.47 img/s | 24.72ms | 26.94ms | 29.33ms | 33.03ms |
-| 2 | 84.16 img/s | 23.66ms | 24.53ms | 25.96ms | 29.42ms |
-| 4 | 165.10 img/s | 24.08ms | 24.59ms | 25.75ms | 27.57ms |
-| 8 | 266.04 img/s | 29.90ms | 30.51ms | 30.84ms | 33.07ms |
-| 16 | 325.89 img/s | 48.57ms | 48.91ms | 49.02ms | 51.01ms |
-| 32 | 365.99 img/s | 86.94ms | 87.15ms | 87.41ms | 90.74ms |
-| 64 | 410.43 img/s | 155.30ms | 156.07ms | 156.36ms | 164.74ms |
-| 128 | N/A | N/A | N/A | N/A | N/A |
+| **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
+|:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
+|       1        |      52 img/s      |    19.39 ms     |    20.39 ms     |    21.18 ms     |
+|       2        |     102 img/s      |    19.98 ms     |     21.4 ms     |    23.75 ms     |
+|       4        |     134 img/s      |    30.12 ms     |    30.14 ms     |    30.54 ms     |
+|       8        |     136 img/s      |    59.07 ms     |    60.63 ms     |    61.49 ms     |
+|       16       |     154 img/s      |    104.38 ms    |    105.21 ms    |    105.81 ms    |
+|       32       |     169 img/s      |    190.12 ms    |    189.64 ms    |    190.24 ms    |
+|       64       |     171 img/s      |    376.19 ms    |    374.16 ms    |    375.6 ms     |
+|      128       |     168 img/s      |    771.4 ms     |    761.64 ms    |    764.7 ms     |
+|      256       |     159 img/s      |   1639.15 ms    |   1603.45 ms    |   1605.47 ms    |
+

 ###### Mixed Precision Inference Latency

-| **batch size** | **Throughput Avg** | **Latency Avg** | **Latency 90%** | **Latency 95%** | **Latency 99%** |
-|:-:|:-:|:-:|:-:|:-:|:-:|
-| 1 | 38.80 img/s | 25.74ms | 26.10ms | 29.28ms | 31.72ms |
-| 2 | 78.79 img/s | 25.29ms | 25.83ms | 27.18ms | 33.07ms |
-| 4 | 160.22 img/s | 24.81ms | 25.58ms | 26.25ms | 27.93ms |
-| 8 | 298.01 img/s | 26.69ms | 27.59ms | 29.13ms | 32.69ms |
-| 16 | 567.48 img/s | 28.05ms | 28.36ms | 31.28ms | 34.44ms |
-| 32 | 709.56 img/s | 44.58ms | 44.69ms | 44.98ms | 47.99ms |
-| 64 | 799.72 img/s | 79.32ms | 79.40ms | 79.49ms | 84.34ms |
-| 128 | 856.19 img/s | 147.92ms | 149.02ms | 149.13ms | 151.90ms |
-
-
-
+| **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
+|:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
+|       1        |      42 img/s      |    24.17 ms     |    27.26 ms     |    29.98 ms     |
+|       2        |      87 img/s      |    23.24 ms     |    24.66 ms     |    26.77 ms     |
+|       4        |     170 img/s      |    23.87 ms     |    24.89 ms     |    29.59 ms     |
+|       8        |     334 img/s      |    24.49 ms     |    27.92 ms     |    35.66 ms     |
+|       16       |     472 img/s      |    34.45 ms     |    34.29 ms     |    35.72 ms     |
+|       32       |     502 img/s      |    64.93 ms     |    64.47 ms     |    65.16 ms     |
+|       64       |     517 img/s      |    126.24 ms    |    125.03 ms    |    125.86 ms    |
+|      128       |     522 img/s      |    250.99 ms    |    245.87 ms    |    247.1 ms     |
+|      256       |     523 img/s      |    502.41 ms    |    487.58 ms    |    489.69 ms    |


 ## Release notes
@ -681,9 +662,10 @@ The following images show a 250 epochs configuration on a DGX-1V.
 2. July 2020
  * Added A100 scripts
  * Updated README
+3. February 2021
+  * Moved from APEX AMP to Native AMP

 ### Known issues

 There are no known issues with this model.

-
--- a/PyTorch/Classification/ConvNets/se-resnext101-32x4d/training/AMP/DGX1V_se-resnext101-32x4d_AMP_250E.sh
+++ b/PyTorch/Classification/ConvNets/se-resnext101-32x4d/training/AMP/DGX1V_se-resnext101-32x4d_AMP_250E.sh
@ -0,0 +1 @@
+python ./launch.py --model se-resnext101-32x4d --precision AMP --mode convergence --platform DGX1V /imagenet --workspace ${1:-./} --raport-file raport.json
--- a/PyTorch/Classification/ConvNets/se-resnext101-32x4d/training/AMP/DGX1V_se-resnext101-32x4d_AMP_90E.sh
+++ b/PyTorch/Classification/ConvNets/se-resnext101-32x4d/training/AMP/DGX1V_se-resnext101-32x4d_AMP_90E.sh
@ -0,0 +1 @@
+python ./launch.py --model se-resnext101-32x4d --precision AMP --mode convergence --platform DGX1V /imagenet --epochs 90 --mixup 0.0 --workspace ${1:-./} --raport-file raport.json
--- a/PyTorch/Classification/ConvNets/se-resnext101-32x4d/training/AMP/DGX1_SE-RNXT101-32x4d_AMP_250E.sh
+++ b/PyTorch/Classification/ConvNets/se-resnext101-32x4d/training/AMP/DGX1_SE-RNXT101-32x4d_AMP_250E.sh
@ -1 +0,0 @@
-python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j8 -p 100 --data-backend pytorch --arch se-resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 128 --amp --static-loss-scale 128 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs 250 --warmup 8 --wd 6.103515625e-05 --mixup 0.2 --memory-format nhwc
--- a/PyTorch/Classification/ConvNets/se-resnext101-32x4d/training/AMP/DGX1_SE-RNXT101-32x4d_AMP_90E.sh
+++ b/PyTorch/Classification/ConvNets/se-resnext101-32x4d/training/AMP/DGX1_SE-RNXT101-32x4d_AMP_90E.sh
@ -1 +0,0 @@
-python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j8 -p 100 --data-backend pytorch --arch se-resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 128 --amp --static-loss-scale 128 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs  90 --warmup 8 --wd 6.103515625e-05 --memory-format nhwc
--- a/PyTorch/Classification/ConvNets/se-resnext101-32x4d/training/AMP/DGXA100_SE-RNXT101-32x4d_AMP_90E.sh
+++ b/PyTorch/Classification/ConvNets/se-resnext101-32x4d/training/AMP/DGXA100_SE-RNXT101-32x4d_AMP_90E.sh
@ -1 +0,0 @@
-python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j16 -p 100 --data-backend pytorch --arch se-resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 128 --amp --static-loss-scale 128 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs  90 --warmup 8 --wd 6.103515625e-05 --memory-format nhwc
--- a/PyTorch/Classification/ConvNets/se-resnext101-32x4d/training/AMP/DGXA100_se-resnext101-32x4d_AMP_250E.sh
+++ b/PyTorch/Classification/ConvNets/se-resnext101-32x4d/training/AMP/DGXA100_se-resnext101-32x4d_AMP_250E.sh
@ -0,0 +1 @@
+python ./launch.py --model se-resnext101-32x4d --precision AMP --mode convergence --platform DGXA100 /imagenet --workspace ${1:-./} --raport-file raport.json
--- a/PyTorch/Classification/ConvNets/se-resnext101-32x4d/training/AMP/DGXA100_se-resnext101-32x4d_AMP_90E.sh
+++ b/PyTorch/Classification/ConvNets/se-resnext101-32x4d/training/AMP/DGXA100_se-resnext101-32x4d_AMP_90E.sh
@ -0,0 +1 @@
+python ./launch.py --model se-resnext101-32x4d --precision AMP --mode convergence --platform DGXA100 /imagenet --epochs 90 --mixup 0.0 --workspace ${1:-./} --raport-file raport.json
--- a/PyTorch/Classification/ConvNets/se-resnext101-32x4d/training/FP32/DGX1V_se-resnext101-32x4d_FP32_250E.sh
+++ b/PyTorch/Classification/ConvNets/se-resnext101-32x4d/training/FP32/DGX1V_se-resnext101-32x4d_FP32_250E.sh
@ -0,0 +1 @@
+python ./launch.py --model se-resnext101-32x4d --precision FP32 --mode convergence --platform DGX1V /imagenet --workspace ${1:-./} --raport-file raport.json
--- a/PyTorch/Classification/ConvNets/se-resnext101-32x4d/training/FP32/DGX1V_se-resnext101-32x4d_FP32_90E.sh
+++ b/PyTorch/Classification/ConvNets/se-resnext101-32x4d/training/FP32/DGX1V_se-resnext101-32x4d_FP32_90E.sh
@ -0,0 +1 @@
+python ./launch.py --model se-resnext101-32x4d --precision FP32 --mode convergence --platform DGX1V /imagenet --epochs 90 --mixup 0.0 --workspace ${1:-./} --raport-file raport.json
--- a/PyTorch/Classification/ConvNets/se-resnext101-32x4d/training/FP32/DGX1_SE-RNXT101-32x4d_FP32_250E.sh
+++ b/PyTorch/Classification/ConvNets/se-resnext101-32x4d/training/FP32/DGX1_SE-RNXT101-32x4d_FP32_250E.sh
@ -1 +0,0 @@
-python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j8 -p 100 --data-backend pytorch --arch se-resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 64 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs 250 --warmup 8 --wd 6.103515625e-05 --mixup 0.2
--- a/PyTorch/Classification/ConvNets/se-resnext101-32x4d/training/FP32/DGX1_SE-RNXT101-32x4d_FP32_90E.sh
+++ b/PyTorch/Classification/ConvNets/se-resnext101-32x4d/training/FP32/DGX1_SE-RNXT101-32x4d_FP32_90E.sh
@ -1 +0,0 @@
-python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j8 -p 100 --data-backend pytorch --arch se-resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 64 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs  90 --warmup 8 --wd 6.103515625e-05
--- a/PyTorch/Classification/ConvNets/se-resnext101-32x4d/training/TF32/DGXA100_SE-RNXT101-32x4d_TF32_90E.sh
+++ b/PyTorch/Classification/ConvNets/se-resnext101-32x4d/training/TF32/DGXA100_SE-RNXT101-32x4d_TF32_90E.sh
@ -1 +0,0 @@
-python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j16 -p 100 --data-backend pytorch --arch se-resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 128 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs  90 --warmup 8 --wd 6.103515625e-05
--- a/PyTorch/Classification/ConvNets/se-resnext101-32x4d/training/TF32/DGXA100_se-resnext101-32x4d_TF32_250E.sh
+++ b/PyTorch/Classification/ConvNets/se-resnext101-32x4d/training/TF32/DGXA100_se-resnext101-32x4d_TF32_250E.sh
@ -0,0 +1 @@
+python ./launch.py --model se-resnext101-32x4d --precision TF32 --mode convergence --platform DGXA100 /imagenet --workspace ${1:-./} --raport-file raport.json
--- a/PyTorch/Classification/ConvNets/se-resnext101-32x4d/training/TF32/DGXA100_se-resnext101-32x4d_TF32_90E.sh
+++ b/PyTorch/Classification/ConvNets/se-resnext101-32x4d/training/TF32/DGXA100_se-resnext101-32x4d_TF32_90E.sh
@ -0,0 +1 @@
+python ./launch.py --model se-resnext101-32x4d --precision TF32 --mode convergence --platform DGXA100 /imagenet --epochs 90 --mixup 0.0 --workspace ${1:-./} --raport-file raport.json
--- a/PyTorch/Classification/ConvNets/triton/deployer.py
+++ b/PyTorch/Classification/ConvNets/triton/deployer.py
@ -61,6 +61,7 @@ def initialize_model(args):
        model.load_state_dict(
            {k.replace("module.", ""): v for k, v in state_dict.items()}
        )
+        model.load_state_dict(state_dict)
    return model.half() if args.fp16 else model


--- a/TensorFlow2/Recommendation/WideAndDeep/.dockerignore
+++ b/TensorFlow2/Recommendation/WideAndDeep/.dockerignore
@ -0,0 +1,23 @@
+# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+.idea
+**/.ipynb_checkpoints
+**/__pycache__
+**/.gitkeep
+.git
+.gitignore
+Dockerfile*
+.dockerignore
+README.md
--- a/TensorFlow2/Recommendation/WideAndDeep/.gitignore
+++ b/TensorFlow2/Recommendation/WideAndDeep/.gitignore
@ -0,0 +1,20 @@
+# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+.idea/
+*.tar
+.ipynb_checkpoints
+/_python_build
+*.pyc
+__pycache__
--- a/TensorFlow2/Recommendation/WideAndDeep/Dockerfile-preproc
+++ b/TensorFlow2/Recommendation/WideAndDeep/Dockerfile-preproc
@ -0,0 +1,54 @@
+# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+ARG FROM_IMAGE_NAME=nvcr.io/nvidia/nvtabular:0.3
+
+FROM ${FROM_IMAGE_NAME}
+
+USER root
+
+# Spark dependencies
+ENV APACHE_SPARK_VERSION 2.3.1
+ENV HADOOP_VERSION 2.7
+
+RUN apt-get -y update && \
+    apt-get install --no-install-recommends -y openjdk-8-jre-headless ca-certificates-java time && \
+    apt-get clean && \
+    rm -rf /var/lib/apt/lists/*
+
+RUN cd /tmp && \
+        wget -q http://archive.apache.org/dist/spark/spark-${APACHE_SPARK_VERSION}/spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz && \
+        echo "DC3A97F3D99791D363E4F70A622B84D6E313BD852F6FDBC777D31EAB44CBC112CEEAA20F7BF835492FB654F48AE57E9969F93D3B0E6EC92076D1C5E1B40B4696 *spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz" | sha512sum -c - && \
+        tar xzf spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz -C /usr/local --owner root --group root --no-same-owner && \
+        rm spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz
+RUN cd /usr/local && ln -s spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION} spark
+
+# Spark config
+ENV SPARK_HOME /usr/local/spark
+ENV PYTHONPATH $SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:/wd
+ENV SPARK_OPTS --driver-java-options=-Xms1024M --driver-java-options=-Xmx4096M --driver-java-options=-Dlog4j.logLevel=info
+ENV PYSPARK_PYTHON /conda/envs/rapids/bin/python
+ENV PYSPARK_DRIVER_PYTHON /conda/envs/rapids/bin/python
+
+SHELL ["/bin/bash", "-c"]
+
+RUN source activate rapids && \
+    pip install --upgrade pip && \
+    pip install --no-cache-dir pyspark==2.3.1 && \
+    pip install --no-cache-dir --no-deps tensorflow-transform==0.24.1 apache-beam==2.14 tensorflow-metadata==0.14.0 pydot dill \
+    pip install --no-cache-dir -e git://github.com/NVIDIA/dllogger#egg=dllogger
+
+WORKDIR  /wd
+
+COPY . .
--- a/TensorFlow2/Recommendation/WideAndDeep/Dockerfile-train
+++ b/TensorFlow2/Recommendation/WideAndDeep/Dockerfile-train
@ -0,0 +1,28 @@
+# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+ARG FROM_IMAGE_NAME=nvcr.io/nvidia/tensorflow:20.12-tf2-py3
+
+FROM ${FROM_IMAGE_NAME}
+
+USER root
+
+RUN pip install --upgrade pip && \
+    pip install --no-cache-dir --no-deps tensorflow-transform==0.24.1 tensorflow-metadata==0.14.0 pydot dill && \
+    pip install --no-cache-dir ipdb pynvml==8.0.4 && \
+    pip install --no-cache-dir -e git://github.com/NVIDIA/dllogger#egg=dllogger
+
+WORKDIR  /wd
+
+COPY . .
--- a/TensorFlow2/Recommendation/WideAndDeep/README.md
+++ b/TensorFlow2/Recommendation/WideAndDeep/README.md
@ -0,0 +1,833 @@
+# Wide & Deep Recommender Model Training For TensorFlow 2
+
+This repository provides a script and recipe to train the Wide and Deep Recommender model to achieve state-of-the-art accuracy.
+The content of the repository is tested and maintained by NVIDIA.
+
+
+- [Model overview](#model-overview)
+  * [Model architecture](#model-architecture)
+  * [Applications and dataset](#applications-and-dataset)
+  * [Default configuration](#default-configuration)
+  * [Model accuracy metric](#model-accuracy-metric)
+  * [Feature support matrix](#feature-support-matrix)
+    + [Features](#features)
+  * [Mixed precision training](#mixed-precision-training)
+    + [Enabling mixed precision](#enabling-mixed-precision)
+    + [Enabling TF32](#enabling-tf32)
+  * [Glossary](#glossary)
+- [Setup](#setup)
+  * [Requirements](#requirements)
+- [Quick Start Guide](#quick-start-guide)
+- [Advanced](#advanced)
+  * [Scripts and sample code](#scripts-and-sample-code)
+  * [Parameters](#parameters)
+  * [Command-line options](#command-line-options)
+  * [Getting the data](#getting-the-data)
+    + [Dataset guidelines](#dataset-guidelines)
+    + [Dataset preprocessing](#dataset-preprocessing)
+      - [Spark CPU Dataset preprocessing](#spark-cpu-dataset-preprocessing)
+      - [NVTabular GPU preprocessing](#nvtabular-gpu-preprocessing)
+  * [Training process](#training-process)
+  * [Evaluation process](#evaluation-process)
+- [Performance](#performance)
+  * [Benchmarking](#benchmarking)
+    + [NVTabular and Spark CPU Preprocessing comparison](#nvtabular-and-spark-cpu-preprocessing-comparison)
+    + [Training and evaluation performance benchmark](#training-and-evaluation-performance-benchmark)
+  * [Results](#results)
+    + [Training accuracy results](#training-accuracy-results)
+      - [Training accuracy: NVIDIA DGX A100 (8x A100 80GB)](#training-accuracy-nvidia-dgx-a100-8x-a100-80gb)
+      - [Training accuracy: NVIDIA DGX-1 (8x V100 16GB)](#training-accuracy-nvidia-dgx-1-8x-v100-16gb)
+      - [Training accuracy plots](#training-accuracy-plots)
+      - [Training stability test](#training-stability-test)
+      - [Impact of mixed precision on training accuracy](#impact-of-mixed-precision-on-training-accuracy)
+    + [Training performance results](#training-performance-results)
+      - [Training performance: NVIDIA DGX A100 (8x A100 80GB)](#training-performance-nvidia-dgx-a100-8x-a100-80gb)
+      - [Training performance: NVIDIA DGX-1 (8x V100 16GB)](#training-performance-nvidia-dgx-1-8x-v100-16gb)
+    + [Evaluation performance results](#evaluation-performance-results)
+      - [Evaluation performance: NVIDIA DGX A100 (8x A100 80GB)](#evaluation-performance-nvidia-dgx-a100-8x-a100-80gb)
+      - [Evaluation performance: NVIDIA DGX-1 (8x V100 16GB)](#evaluation-performance-nvidia-dgx-1-8x-v100-16gb)
+- [Release notes](#release-notes)
+  * [Changelog](#changelog)
+  * [Known issues](#known-issues)
+
+
+
+## Model overview
+
+Recommendation systems drive engagement on many of the most popular online platforms. As the volume of data available to power these systems grows exponentially, Data Scientists are increasingly turning from more traditional machine learning methods to highly expressive deep learning models to improve the quality of their recommendations.
+
+Google's [Wide & Deep Learning for Recommender Systems](https://arxiv.org/abs/1606.07792) has emerged as a popular model for Click Through Rate (CTR) prediction tasks thanks to its power of generalization (deep part) and memorization (wide part).
+The differences between this Wide & Deep Recommender Model and the model from the paper is the size of the deep part of the model. Originally, in Google's paper, the fully connected part was three layers of 1024, 512, and 256 neurons. Our model consists of 5 layers each of 1024 neurons.
+
+This model is trained with mixed precision using Tensor Cores on NVIDIA Volta and NVIDIA Ampere GPU architectures. Therefore, researchers can get results 4.5 times faster than training without Tensor Cores, while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.
+
+### Model architecture
+
+Wide & Deep refers to a class of networks that use the output of two parts working in parallel - wide model and deep model - to make a binary prediction of CTR. The wide model is a linear model of features together with their transforms. The deep model is a series of 5 hidden MLP layers of 1024 neurons. The model can handle both numerical continuous features as well as categorical features represented as dense embeddings. The architecture of the model is presented in Figure 1.
+
+<p align="center">
+  <img width="100%" src="./img/model.svg">
+  <br>
+Figure 1. The architecture of the Wide & Deep model.</a>
+</p>
+
+
+
+
+
+### Applications and dataset
+
+As a reference dataset, we used a subset of [the features engineered](https://github.com/gabrielspmoreira/kaggle_outbrain_click_prediction_google_cloud_ml_engine) by the 19th place finisher in the [Kaggle Outbrain Click Prediction Challenge](https://www.kaggle.com/c/outbrain-click-prediction/). This competition challenged competitors to predict the likelihood with which a particular ad on a website's display would be clicked on. Competitors were given information about the user, display, document, and ad in order to train their models. More information can be found [here](https://www.kaggle.com/c/outbrain-click-prediction/data).
+
+### Default configuration
+
+The Outbrain Dataset is preprocessed in order to get features input to the model. To give context to the acceleration numbers described below, some important properties of our features and model are as follows.
+
+Features:
+- Request Level:
+    * 5 scalar numeric features `dtype=float32`
+    * 8 categorical features (all INT32 `dtype`)
+    * 8 trainable embeddings of (dimension, cardinality of categorical variable): (128,300000), (16,4), (128,100000), (64 ,4000), (64,1000), (64,2500), (64,300), (64,2000)
+    * 8  trainable embeddings for wide part of size 1 (serving as an embedding from the categorical to scalar space for input to the wide portion of the model)
+
+- Item Level:
+    * 8 scalar numeric features `dtype=float32`
+    * 5 categorical features (all INT32 `dtype`)
+    * 5 trainable embeddings of dimensions (cardinality of categorical variable): 128 (250000), 64 (2500), 64 (4000), 64 (1000),64 (5000)
+    * 5 trainable embeddings for wide part of size 1 (working as trainable one-hot embeddings)
+
+Features describe both the user (Request Level features) and Item (Item Level Features).
+
+- Model:
+    * Input dimension is 26 (13 categorical and 13 numerical features)
+    * Total embedding dimension is 976
+    * 5 hidden layers each with size 1024
+    * Total number of model parameter is ~90M
+    * Output dimension is 1 (`y` is the probability of click given Request-level and Item-level features)
+    * Loss function: Binary Crossentropy
+
+For more information about feature preprocessing, go to [Dataset preprocessing](#dataset-preprocessing).
+
+### Model accuracy metric
+
+Model accuracy is defined with the [MAP@12](https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)#Mean_average_precision) metric. This metric follows the way of assessing model accuracy in the original [Kaggle Outbrain Click Prediction Challenge](https://www.kaggle.com/c/outbrain-click-prediction/). In this repository, the leaked clicked ads are not taken into account since in industrial setup Data Scientists do not have access to leaked information when training the model. For more information about data leak in Kaggle Outbrain Click Prediction challenge, visit this  [blogpost](https://medium.com/unstructured/how-feature-engineering-can-help-you-do-well-in-a-kaggle-competition-part-ii-3645d92282b8) by the 19th place finisher in that competition.
+
+Training and evaluation script also reports AUC ROC, binary accuracy, and Loss (BCE) values.
+
+### Feature support matrix
+
+The following features are supported by this model:
+
+| Feature                          | Wide & Deep |
+| -------------------------------- | ----------- |
+| Horovod Multi-GPU (NCCL)         | Yes         |
+| Accelerated Linear Algebra (XLA) | Yes         |
+| Automatic mixed precision (AMP)  | Yes         |
+
+#### Features
+
+**Horovod** is a distributed training framework for TensorFlow, Keras, PyTorch and MXNet. The goal of Horovod is to make distributed deep learning fast and easy to use. For more information about how to get started with Horovod, see: [Horovod: Official repository](https://github.com/horovod/horovod).
+
+**Multi-GPU training with Horovod**
+Our model uses Horovod to implement efficient multi-GPU training with NCCL. For details, see example sources in this repository or see: [TensorFlow tutorial](https://github.com/horovod/horovod/#usage).
+
+**XLA** is a domain-specific compiler for linear algebra that can accelerate TensorFlow models with potentially no source code changes. Enabling XLA results in improvements to speed and memory usage: most internal benchmarks run ~1.1-1.5x faster after XLA is enabled. For more information on XLA visit [official XLA page](https://www.tensorflow.org/xla).
+
+### Mixed precision training
+
+Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores) in Volta, and following with both the Turing and Ampere architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using [mixed precision training](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html) previously required two steps:
+1.  Porting the model to use the FP16 data type where appropriate.    
+2.  Adding loss scaling to preserve small gradient values.
+
+The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in [CUDA 8](https://devblogs.nvidia.com/parallelforall/tag/fp16/) in the NVIDIA Deep Learning SDK.
+
+For more information:
+* How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html) documentation.
+* Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog.
+* How to access and enable AMP for TensorFlow, see [Using TF-AMP](https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tfamp) from the TensorFlow User Guide.
+
+For information on the influence of mixed precision training on model accuracy in train and inference, go to [Training accuracy results](Training-accuracy-results).
+
+#### Enabling mixed precision
+
+To enable Wide & Deep training to use mixed precision, add the additional flag `--amp` to the training script. Refer to the [Quick Start Guide](#quick-start-guide) for more information.
+
+#### Enabling TF32
+
+TensorFloat-32 (TF32) is the new math mode in NVIDIA A100 GPUs for handling tensor operations. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on Volta GPUs.
+
+TF32 Tensor Cores can speed up networks using FP32, typically with no loss of accuracy. It is more robust than FP16 for models which require high dynamic range for weights or activations.
+
+For more information, refer to the [TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/) blog post.
+
+TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by default.
+
+### Glossary
+
+**Request level features**
+Features that describe the person and context to which we wish to make recommendations.
+
+**Item level features**
+Features that describe those objects which we are considering recommending.
+
+## Setup
+The following section lists the requirements that you need to meet in order to start training the Wide & Deep model.
+
+### Requirements
+
+This repository contains Dockerfile which extends the TensorFlow2 NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
+- [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
+- [20.12-tf2-py3](https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow) NGC container
+
+Supported GPUs:
+- [NVIDIA Volta architecture](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
+- [NVIDIA Turing architecture](https://www.nvidia.com/en-us/design-visualization/technologies/turing-architecture/)
+- [NVIDIA Ampere architecture](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/)
+
+
+For more information about how to get started with NGC containers, see the following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning Documentation:
+* [Getting Started Using NVIDIA GPU Cloud](https://docs.nvidia.com/ngc/ngc-getting-started-guide/index.html)
+* [Accessing And Pulling From The NGC Container Registry](https://docs.nvidia.com/deeplearning/frameworks/user-guide/index.html#accessing_registry)
+* [Running TensorFlow](https://docs.nvidia.com/deeplearning/frameworks/tensorflow-release-notes/running.html#running)
+
+For those unable to use the TensorFlow2 NGC container, to set up the required environment or create their own container, see the versioned [NVIDIA Container Support Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html).
+
+## Quick Start Guide
+
+To train your model using the default parameters of the Wide & Deep model on the Outbrain dataset in TF32 or FP32 precision, perform the following steps. For the specifics concerning training and inference with custom settings, see the [Advanced section](#advanced).
+
+1. Clone the repository.
+```
+git clone https://github.com/NVIDIA/DeepLearningExamples
+```
+
+2. Go to `WideAndDeep` TensorFlow2 directory within the `DeepLearningExamples` repository:
+```
+cd DeepLearningExamples/TensorFlow2/Recommendation/WideAndDeep
+```
+
+3. Download the Outbrain dataset.
+
+The Outbrain dataset can be downloaded from Kaggle (requires Kaggle account). Unzip the downloaded archive (for example, to `/raid/outbrain/orig`) and set the `HOST_OUTBRAIN_PATH` variable to the parent directory:
+```
+HOST_OUTBRAIN_PATH=/raid/outbrain
+```
+
+4. Preprocess the Outbrain dataset.
+
+4.1. Build the Wide & Deep Preprocessing Container.
+```
+cd DeepLearningExamples/TensorFlow2/Recommendation/WideAndDeep
+docker build -f Dockerfile-preproc . -t wd2-prep
+```
+
+4.2. Start an interactive session in the Wide&Deep Preprocessing Container. Run preprocessing against the original Outbrain dataset to `tf_records`. You can run preprocessing using Spark (CPU) or NVTabular preprocessing (GPU).
+```
+nvidia-docker run --rm -it --ipc=host -v ${HOST_OUTBRAIN_PATH}:/outbrain wd2-prep bash
+```
+
+4.3. Start preprocessing.
+You can preprocess the data using either Spark on CPU or NVTabular on GPU. For more information, go to the [Dataset preprocessing](#dataset-preprocessing) section.
+
+4.3.1. CPU Preprocessing (Spark).
+```
+cd /wd && bash scripts/preproc.sh spark 40
+```
+
+4.3.2. GPU Preprocessing (NVTabular).
+```
+cd /wd && bash scripts/preproc.sh nvtabular 40
+```
+
+The result of preprocessing scripts are prebatched TFRecords. The argument to the script is the number of TFRecords files that will be generated by the script (here 40). TFRecord files are generated in `${HOST_OUTBRAIN_PATH}/tfrecords`.
+
+4.4. Training of the model
+4.4.1. Build the Wide&Deep Training Container
+```
+cd DeepLearningExamples/TensorFlow2/Recommendation/WideAndDeep
+docker build -f Dockerfile-train . -t wd2-train
+```
+
+4.4.2. Start an interactive session in the Wide&Deep Training Container
+```
+nvidia-docker run --rm -it --privileged --ipc=host -v ${HOST_OUTBRAIN_PATH}:/outbrain wd2-train bash
+```
+
+4.4.3. Run training
+For 1 GPU:
+
+```
+python main.py
+```
+
+For 1 GPU with Mixed Precision training with XLA:
+
+```
+python main.py --xla --amp
+```
+
+
+For complete usage, run:
+```
+python main.py -h
+```
+
+For 8 GPUs:
+```
+mpiexec --allow-run-as-root --bind-to socket -np 8 python main.py
+```
+
+For 8 GPU with Mixed Precision training with XLA:
+```
+mpiexec --allow-run-as-root --bind-to socket -np 8 python main.py --xla --amp
+```
+
+
+
+5. Run validation or evaluation.
+If you want to run validation or evaluation, you can either:
+* use the checkpoint obtained from the training commands above, or
+* download the pretrained checkpoint from NGC.
+
+In order to download the checkpoint from NGC, visit [ngc.nvidia.com](https://ngc.nvidia.com) website and browse the available models. Download the checkpoint files and unzip them to some path, for example, to `$HOST_OUTBRAIN_PATH/checkpoints/` (which is the default path for storing the checkpoints during training). The checkpoint requires around 700MB disk space.
+
+6. Start validation/evaluation.
+In order to validate the checkpoint on the evaluation set, run the `main.py` script with the `--evaluate` and `--use_checkpoint` flags.
+
+For 1 GPU:
+```
+python main.py --evaluate --use_checkpoint
+```
+
+For 8 GPUs:
+```
+mpiexec --allow-run-as-root --bind-to socket -np 8 python main.py --evaluate --use_checkpoint
+```
+
+Now that you have your model trained and evaluated, you can choose to compare your training results with our [Training accuracy results](#training-accuracy-results). You can also choose to benchmark yours performance to [Training and evaluation performance benchmark](#training-and-evaluation-performance-benchmark). Following the steps in these sections will ensure that you achieve the same accuracy and performance results as stated in the [Results](#results) section.
+
+## Advanced
+
+The following sections provide greater details of the dataset, running training, and the training results.
+
+### Scripts and sample code
+
+These are the important scripts in this repository:
+* `main.py` - Python script for training the Wide & Deep recommender model. This script is run inside the training container (named `wd-train` in the [Quick Start Guide](#quick-start-guide)).
+* `scripts/preproc.sh` - Bash script for Outbrain dataset preparation for training, preprocessing and saving into TFRecords format. This script is run inside a preprocessing container (named `wd-prep` in the [Quick Start Guide](#quick-start-guide)).
+* `data/outbrain/dataloader.py` - Python file containing data loaders for training and evaluation set.
+* `data/outbrain/features.py` - Python file describing the request and item level features as well as embedding dimensions and hash buckets’ sizes.
+* `trainer/model/widedeep.py` - Python file with model definition.
+* `trainer/utils/run.py` - Python file with training loop.
+
+### Parameters
+
+These are the important parameters in the `main.py` script:
+
+| Scope| parameter| Comment| Default Value |
+| -------------------- | ----------------------------------------------------- | ------------------------------------------------------------ | ------------- |
+| location of datasets | --transformed_metadata_path TRANSFORMED_METADATA_PATH | Path to transformed_metadata for feature specification reconstruction |               |
+|location of datasets| --use_checkpoint|Use checkpoint stored in model_dir path |False
+|location of datasets|--model_dir MODEL_DIR|Destination where model checkpoint will be saved |/outbrain/checkpoints
+|location of datasets|--results_dir RESULTS_DIR|Directory to store training results | /results
+|location of datasets|--log_filename LOG_FILENAME|Name of the file to store dlloger output |log.json|
+|training parameters|--training_set_size TRAINING_SET_SIZE|Number of samples in the training set | 59761827
+|training parameters|--global_batch_size GLOBAL_BATCH_SIZE|Total size of training batch | 131072
+|training parameters|--eval_batch_size EVAL_BATCH_SIZE|Total size of evaluation batch | 131072
+|training parameters|--num_epochs NUM_EPOCHS|Number of training epochs | 20
+|training parameters|--cpu|Run computations on the CPU | False
+|training parameters|--amp|Enable automatic mixed precision conversion | False
+|training parameters|--xla|Enable XLA conversion | False
+|training parameters|--linear_learning_rate LINEAR_LEARNING_RATE|Learning rate for linear model | 0.02
+|training parameters|--deep_learning_rate DEEP_LEARNING_RATE|Learning rate for deep model | 0.00012
+|training parameters|--deep_warmup_epochs DEEP_WARMUP_EPOCHS|Number of learning rate warmup epochs for deep model | 6
+|model construction|--deep_hidden_units DEEP_HIDDEN_UNITS [DEEP_HIDDEN_UNITS ...]|Hidden units per layer for deep model, separated by spaces|[1024, 1024, 1024, 1024, 1024]
+|model construction|--deep_dropout DEEP_DROPOUT|Dropout regularization for deep model|0.1
+|run mode parameters|--evaluate|Only perform an evaluation on the validation dataset, don't train | False
+|run mode parameters|--benchmark|Run training or evaluation benchmark to collect performance metrics | False
+|run mode parameters|--benchmark_warmup_steps BENCHMARK_WARMUP_STEPS|Number of warmup steps before start of the benchmark | 500
+|run mode parameters|--benchmark_steps BENCHMARK_STEPS|Number of steps for performance benchmark | 1000
+|run mode parameters|--affinity{socket,single,single_unique,<br>socket_unique_interleaved,<br>socket_unique_continuous,disabled}|Type of CPU affinity | socket_unique_interleaved
+
+
+
+
+### Command-line options
+To see the full list of available options and their descriptions, use the `-h` or `--help` command-line option:
+```
+python main.py -h
+```
+
+### Getting the data
+
+The Outbrain dataset can be downloaded from [Kaggle](https://www.kaggle.com/c/outbrain-click-prediction/data) (requires Kaggle account).
+
+#### Dataset guidelines
+
+The dataset contains a sample of users’ page views and clicks, as observed on multiple publisher sites. Viewed pages and clicked recommendations have additional semantic attributes of the documents. The dataset contains sets of content recommendations served to a specific user in a specific context. Each context (i.e. a set of recommended ads) is given a `display_id`. In each such recommendation set, the user has clicked on exactly one of the ads.
+
+The original data is stored in several separate files:
+* `page_views.csv` - log of users visiting documents (2B rows, ~100GB uncompressed)
+* `clicks_train.csv` - data showing which ad was clicked in each recommendation set (87M rows)
+* `clicks_test.csv` - used only for the submission in the original Kaggle contest
+* `events.csv` - metadata about the context of each recommendation set (23M rows)
+* `promoted_content.csv` - metadata about the ads
+* `document_meta.csv`, `document_topics.csv`, `document_entities.csv`, `document_categories.csv` - metadata about the documents
+
+During the preprocessing stage, the data is transformed into 87M rows tabular data of 26 features. The dataset is split into training and evaluation parts that have approx 60M and approx 27M rows, respectively. Splitting into train and eval is done in this way so that random 80% of daily events for the first 10 days of the dataset form a training set and remaining part (20% of events daily for the first 10 days and all events in the last two days) form an evaluation set. Eventually the dataset is saved in pre-batched TFRecord format.
+
+#### Dataset preprocessing
+
+Dataset preprocessing aims in creating in total 26 features: 13 categorical and 13 numerical. These features are obtained from the original Outbrain dataset in preprocessing. There are 2 types of preprocessing available for the model:
+Spark CPU preprocessing
+[NVTabular](https://nvidia.github.io/NVTabular/v0.3.0/index.html) GPU preprocessing
+
+Both split the dataset into train and evaluation sets and produce the same feature set, therefore, the training is agnostic to the preprocessing step.
+
+For comparison of Spark CPU and NVTabular preprocessing go to [NVTabular and Spark CPU Preprocessing comparison](#nvtabular-and-spark-cpu-preprocessing-comparison)
+
+##### Spark CPU Dataset preprocessing
+
+The original dataset is preprocessed using the scripts provided in `data/outbrain/spark`. Preprocessing is split into 3 preprocessing steps: `preproc1.py`, `preproc2.py`, and `preproc3.py` that form a complete workflow. The workflow consists of the following operations:
+* separating out the validation set for cross-validation
+* filling missing data with mode, median, or imputed values
+* joining click data, ad metadata,  and document category, topic and entity tables to create an enriched table
+* computing  7 click-through rates (CTR) for ads grouped by 7 features
+* computing attribute cosine similarity between the landing page and ad to be featured on the page
+* math transformations of the numeric features (logarithmic, scaling, binning)
+* categorifying data using hash-bucketing
+* storing the resulting set of features in pre-batched TFRecord format
+
+The `preproc1-3.py` preprocessing scripts use PySpark. In the Docker image, we have installed Spark 2.3.1 as a standalone cluster of Spark. The `preproc1.py` script splits the data into a training set and a validation set. The `preproc2.py` script computes the click-through rates (CTR) and cosine similarities between the features. The `preproc3.py` script performs the math transformations and generates the final TFRecord files. The data in the output files is pre-batched (with the default batch size of 4096) to avoid the overhead of the TFRecord format, which otherwise is not suitable for the tabular data.
+The preprocessing includes some very resource-exhaustive operations including joining tables having over 2 billions of rows. Such operations may not fit into the RAM memory, and therefore we use Spark which is well suited for handling tabular operations on large data with limited RAM. Note that the Spark job requires about 500 GB disk space and 300 GB RAM to perform the preprocessing.
+
+For more information about Spark, refer to the [Spark documentation](https://spark.apache.org/docs/2.3.1/).
+
+##### NVTabular GPU preprocessing
+
+The NVTabular dataset is preprocessed using the script provided in `data/outbrain/nvtabular`. The workflow consists of most of the same operations in the Spark pipeline:
+* separating out the validation set for cross-validation
+* filling missing data with themode, median, or imputed values most frequent value
+* joining click data, ad metadata, and document category, topic and entity tables to create an enriched table.joining the  tables for the ad clicks data
+* computing  7 click-through rates (CTR) for ads grouped by 7 features different contexts
+* computing attribute cosine similarity between the landing page and ad to be featured on the page features of the clicked ads and the viewed ads
+* math transformations of the numeric features (logarithmic, normalization)
+* categorifying data using hash-bucketing
+* storing the result in a Parquet format
+
+**Transforming the result into the pre-batched TFRecord format**
+
+Most of the code describing operations in this workflow are in `data/outbrain/nvtabular/utils/workflow.py` and leverage NVTabular v0.3. As stated in its repository, [NVTabular](https://github.com/NVIDIA/NVTabular), a component of [NVIDIA Merlin Open Beta](https://developer.nvidia.com/nvidia-merlin), is a feature engineering and preprocessing library for tabular data that is designed to quickly and easily manipulate terabyte scale datasets and train deep learning based recommender systems. It provides a high-level abstraction to simplify code and accelerates computation on the GPU using the [RAPIDS Dask-cuDF](https://github.com/rapidsai/cudf/tree/main/python/dask_cudf) library. The code to transform the NVTabular Parquet output into TFRecords is in `data/outbrain/nvtabular/utils/converter.py`.
+The NVTabular version of preprocessing is not subject to the same memory and storage constraints as its Spark counterpart, since NVTabular is able to manipulate tables on GPU and work with tables much larger than even physical RAM memory. The NVTabular Outbrain workflow has been successfully tested on DGX-1 V100 and DGX A100 for single and multigpu preprocessing.
+
+For more information about NVTabular, refer to the [NVTabular documentation](https://github.com/NVIDIA/NVTabular).
+
+### Training process
+
+The training can be started by running the `main.py` script. By default, the script is in train mode. Other training related configs are also present in the `trainer/utils/arguments.py` and can be seen using the command `python main.py -h`. Training happens on a TFRecords training dataset files that match `--train_data_pattern`.  Training is run for `--num_epochs` epochs with a global batch size of `--global_batch_size` in strong scaling mode (i.e. the effective batch size per GPU equals `global_batch_size/gpu_count`).
+
+The model:
+`tf.keras.experimental.WideDeepModel` consists of a wide part and deep part with a sigmoid activation in the output layer (see [Figure 1](#model-architecture)) for reference and `trainer/model/widedeep.py` for model definition).
+
+During training (default configuration):
+Two separate optimizers are used to optimize the wide and the deep part of the network:
+* FTLR (Follow the Regularized Leader) optimizer is used to optimize the wide part of the network.
+* RMSProp optimizer is used to optimize the deep part of the network.
+
+Checkpoint of the model:
+* Can be loaded at the beginning of training when `--use_checkpoint` is set
+* is saved into `--model_dir` after each training epoch. Only the last checkpoint is kept.
+* Contains information about number of training epochs
+
+The model is evaluated on an evaluation dataset after every training epoch training log is displayed in the console and stored in  `--log_filename`.
+
+Every 100 batches with training metrics:
+loss, binary accuracy, AUC ROC, MAP@12 value
+
+Every epoch after training, evaluation metrics are logged:
+loss, binary accuracy, AUC ROC, MAP@12 value
+
+### Evaluation process
+
+The evaluation can be started by running the `main.py --evaluation` script. Evaluation is done for TFRecords dataset stored in `--eval_data_pattern`. Other evaluation related configs are also present in the `trainer/utils/arguments.py` and can be seen using the command `python main.py -h`.
+
+During evaluation (`--evaluation flag`):
+* Model is restored from checkpoint in `--model_dir` if `--use_checkpoint` is set
+* Evaluation log is displayed in console and stored in  `--log_filename`
+* Every 100 batches evaluation metrics are logged - loss, binary accuracy, AUC ROC, MAP@12 value
+
+After the whole evaluation, the total evaluation metrics are logged,  loss, binary accuracy, AUC ROC, MAP@12 value.
+
+## Performance
+
+### Benchmarking
+
+The following section shows how to run benchmarks measuring the model performance in training mode.
+
+#### NVTabular and Spark CPU Preprocessing comparison
+
+Two types of dataset preprocessing are presented in Spark-CPU and NVTabular on GPU repositories. Both of these preprocess return prebatched TFRecords files with the same structure. The following table shows the comparison of both preprocessing in terms of code complication (Lines of code), top RAM consumption, and preprocessing time.
+
+| |CPU preprocessing     | CPU Preprocessing        | GPU preprocessing        | GPU Preprocessing        | GPU preprocessing        | GPU Preprocessing        |
+| -------------------------- | ----- | --------------------- | ------------------------ | ------------------------ | ------------------------ | ------------------------|
+|| Spark on NVIDIA DGX-1 | Spark on NVIDIA DGX A100 | NVTabular on DGX-1 1 GPU | NVTabular on DGX-1 8 GPU | NVTabular DGX A100 1 GPU | NVTabular DGX A100 8 GPU |      |
+| Lines of code*             | ~1500 | ~1500| ~500| ~500| ~500| ~500|
+| Top RAM consumption \[GB\] | 167.0 | 223.4| 34.3| 48.7| 37.7 | 50.6|
+| Top VRAM consumption per GPU \[GB\] | 0 | 0 | 16 | 13 | 45 | 67|
+| Preprocessing time \[min\] |45.6|38.5|4.4|3.9|2.6| 2.3|
+
+
+To achieve the same results for Top RAM consumption and preprocessing time, run a preprocessing container (`${HOST_OUTBRAIN_PATH}` is the path with Outbrain dataset).
+```
+nvidia-docker run --rm -it --ipc=host -v ${HOST_OUTBRAIN_PATH}:/outbrain wd2-prep bash
+```
+
+In the preprocessing container, run the preprocessing benchmark.
+
+For Spark CPU preprocessing:
+```
+cd /wd && bash scripts/preproc_benchmark.sh -m spark
+```
+
+For GPU NVTabular preprocessing:
+```
+cd /wd && bash scripts/preproc_benchmark.sh -m nvtabular
+```
+
+
+#### Training and evaluation performance benchmark
+
+Benchmark script is prepared to measure performance of the model during training (default configuration) and evaluation (`--evaluation`). Benchmark runs training or evaluation for `--benchmark_steps` batches, however measurement of performance starts after `--benchmark_warmup_steps`. Benchmark can be run for single and 8 GPUs and with a combination of XLA (`--xla`), AMP (`--amp`), batch sizes (`--global_batch_size` , `--eval_batch_size`) and affinity (`--affinity`).
+
+In order to run benchmark follow these steps:
+Run training container (`${HOST_OUTBRAIN_PATH}` is the path with Outbrain dataset):
+```
+nvidia-docker run --rm -it --ipc=host --privileged -v ${HOST_OUTBRAIN_PATH}:/outbrain wd2-train bash
+```
+
+Run the benchmark script:
+For 1 GPU:
+```
+python main.py --benchmark
+```
+
+The benchmark will be run for training with default training parameters.
+
+For 8GPUs:
+```
+mpiexec --allow-run-as-root --bind-to socket -np 8 python main.py --benchmark
+```
+
+### Results
+
+The following sections provide details on how we achieved our performance and accuracy in training.
+
+#### Training accuracy results
+
+##### Training accuracy: NVIDIA DGX A100 (8x A100 80GB)
+
+Our results were obtained by running the `main.py` training script in the TensorFlow2 NGC container on NVIDIA DGX A100 with (8x A100 80GB) GPUs.
+
+| GPUs | Batch size / GPU | XLA | Accuracy - TF32 (MAP@12), Spark dataset | Accuracy - mixed precision (MAP@12),Spark Dataset | Accuracy - TF32 (MAP@12), NVTabular dataset | Accuracy - mixed precision (MAP@12), NVTabular Dataset | Time to train - TF32 (minutes) | Time to train - mixed precision (minutes) | Time to train speedup (TF32 to mixed precision) |
+| ---- | ---------------- | --- | --------------|---|------- | ----------------------------------- | ------------------------------ | ----------------------------------------- | ----------------------------------------------- |
+1|131072|Yes|0.65536|0.65537|0.65537|0.65646|16.40|13.71|1.20
+1|131072|No|0.65538|0.65533|0.65533|0.65643|19.58|18.49|1.06
+8|16384|Yes|0.65527|0.65525|0.65525|0.65638|7.77|9.71|0.80
+8|16384|No|0.65517|0.65525|0.65525|0.65638|7.84|9.48|0.83
+
+
+To achieve the same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
+
+##### Training accuracy: NVIDIA DGX-1 (8x V100 16GB)
+
+Our results were obtained by running the main.py training script in the TensorFlow2 NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs.
+
+
+| GPUs | Batch size / GPU | XLA | Accuracy - TF32 (MAP@12), Spark dataset | Accuracy - mixed precision (MAP@12),Spark Dataset | Accuracy - TF32 (MAP@12), NVTabular dataset | Accuracy - mixed precision (MAP@12), NVTabular Dataset | Time to train - TF32 (minutes) | Time to train - mixed precision (minutes) | Time to train speedup (TF32 to mixed precision) |
+| ---- | ---------------- | --- | --------------|---|------- | ----------------------------------- | ------------------------------ | ----------------------------------------- | ----------------------------------------------- |
+1|131072|Yes|0.65531|0.65529|0.65529|0.65651|66.01|23.66|2.79
+1|131072|No|0.65542|0.65534|0.65534|0.65641|72.68|29.18|2.49|
+8|16384|Yes|0.65544|0.65547|0.65547|0.65642|16.28|13.90|1.17|
+8|16384|No|0.65548|0.65540|0.65540|0.65633|16.34|12.65|1.29|                                  |
+
+
+To achieve the same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
+
+##### Training accuracy plots
+
+Models trained with FP32, TF32 and Automatic Mixed Precision (AMP), with and without XLA enabled achieve similar accuracy.
+
+The plot represents MAP@12 in a function of steps (step is single batch) during training for default precision (FP32 for Volta architecture (DGX-1) and TF32 for Ampere GPU architecture (DGX-A100)) and AMP  for XLA and without it for both datasets. All other parameters of training are default.
+
+<p align="center">
+  <img width="100%" src="./img/leraning_curve_spark.svg" />
+  <br>
+  Figure 2. Learning curves for Spark dataset for different configurations.</a>
+</p>
+
+<p align="center">
+  <img width="100%" src="./img/learning_curve_nvt.svg" />
+  <br>
+  Figure 3. Learning curves for NVTabular dataset for different configurations.</a>
+</p>
+
+
+
+
+
+
+##### Training stability test
+
+Training of the model is stable for multiple configurations achieving the standard deviation of 10e-4. The model achieves similar MAP@12 scores for A100 and V100, training precisions, XLA usage and single/multi GPU. The Wide and Deep model was trained for 9100 training steps (20 epochs, 455 batches in each epoch, every batch containing 131072), starting from 20 different initial random seeds for each setup. The training was performed in the 20.12-tf1-py3 NGC container on NVIDIA DGX A100 80GB and DGX-1 16GB machines with and without mixed precision enabled, with and without XLA enabled for Spark- and NVTabular generated datasets. The provided charts and numbers consider single and 8 GPU training. After training, the models were evaluated on the validation set. The following plots compare distributions of MAP@12 on the evaluation set. In columns there is single vs 8 GPU training, in rows DGX A100 and DGX-1 V100.
+
+<p align="center">
+  <img width="100%" src="./img/training_stability_spark.svg" />
+  <br>
+  Figure 4. Training stability for Spark dataset: distribution of MAP@12 across different configurations. 'All configurations' refer to the distribution of MAP@12 for cartesian product of architecture, training precision, XLA usage, single/multi GPU. </a>
+</p>
+
+<p align="center">
+  <img width="100%" src="./img/training_stability_nvtabular.svg" />
+  <br>
+  Figure 5. Training stability for NVtabular dataset: distribution of MAP@12 across different configurations. 'All configurations' refer to the distribution of MAP@12 for cartesian product of architecture, training precision, XLA usage, single/multi GPU.</a>
+</p>
+
+
+
+Training stability was also compared in terms of point statistics for MAP@12 distribution for multiple configurations. Refer to the expandable table below.
+
+<details>
+<summary>Full tabular data for training stability tests</summary>
+
+||GPUs|Precicision|Dataset|XLA|mean|std|Min|Max
+|--------|-|---------|-----------|---|----|---|---|---
+DGX A100|1|TF32|Spark preprocessed|Yes|0.65536|0.00016|0.65510|0.65560|
+DGX A100|1|TF32|Spark preprocessed|No|0.65538|0.00013|0.65510|0.65570|
+DGX A100|1|TF32|NVTabular preprocessed|Yes|0.65641|0.00038|0.65530|0.65680|
+DGX A100|1|TF32|NVTabular preprocessed|No|0.65648|0.00024|0.65580|0.65690|
+DGX A100|1|AMP|Spark preprocessed|Yes|0.65537|0.00013|0.65510|0.65550|
+DGX A100|1|AMP|Spark preprocessed|No|0.65533|0.00016|0.65500|0.65550|
+DGX A100|1|AMP|NVTabular preprocessed|Yes|0.65646|0.00036|0.65530|0.65690|
+DGX A100|1|AMP|NVTabular preprocessed|No|0.65643|0.00027|0.65590|0.65690|
+DGX A100|8|TF32|Spark preprocessed|Yes|0.65527|0.00013|0.65500|0.65560|
+DGX A100|8|TF32|Spark preprocessed|No|0.65517|0.00025|0.65460|0.65560|
+DGX A100|8|TF32|NVTabular preprocessed|Yes|0.65631|0.00038|0.65550|0.65690|
+DGX A100|8|TF32|NVTabular preprocessed|No|0.65642|0.00022|0.65570|0.65680|
+DGX A100|8|AMP|Spark preprocessed|Yes|0.65525|0.00018|0.65490|0.65550|
+DGX A100|8|AMP|Spark preprocessed|No|0.65525|0.00016|0.65490|0.65550|
+DGX A100|8|AMP|NVTabular preprocessed|Yes|0.65638|0.00026|0.65580|0.65680|
+DGX A100|8|AMP|NVTabular preprocessed|No|0.65638|0.00031|0.65560|0.65700|
+DGX-1 V100|1|FP32|Spark preprocessed|Yes|0.65531|0.00017|0.65490|0.65560|
+DGX-1 V100|1|FP32|Spark preprocessed|No|0.65542|0.00012|0.65520|0.65560|
+DGX-1 V100|1|FP32|NVTabular preprocessed|Yes|0.65651|0.00019|0.65610|0.65680|
+DGX-1 V100|1|FP32|NVTabular preprocessed|No|0.65638|0.00035|0.65560|0.65680|
+DGX-1 V100|1|AMP|Spark preprocessed|Yes|0.65529|0.00015|0.65500|0.65570|
+DGX-1 V100|1|AMP|Spark preprocessed|No|0.65534|0.00015|0.65500|0.65560|
+DGX-1 V100|1|AMP|NVTabular preprocessed|Yes|0.65651|0.00028|0.65560|0.65690|
+DGX-1 V100|1|AMP|NVTabular preprocessed|No|0.65641|0.00032|0.65570|0.65680|
+DGX-1 V100|8|FP32|Spark preprocessed|Yes|0.65544|0.00019|0.65500|0.65580|
+DGX-1 V100|8|FP32|Spark preprocessed|No|0.65548|0.00013|0.65510|0.65560|
+DGX-1 V100|8|FP32|NVTabular preprocessed|Yes|0.65645|0.00012|0.65630|0.65670|
+DGX-1 V100|8|FP32|NVTabular preprocessed|No|0.65638|0.00015|0.65610|0.65670|
+DGX-1 V100|8|AMP|Spark preprocessed|Yes|0.65547|0.00015|0.65520|0.65580|
+DGX-1 V100|8|AMP|Spark preprocessed|No|0.65540|0.00019|0.65500|0.65580|
+DGX-1 V100|8|AMP|NVTabular preprocessed|Yes|0.65642|0.00028|0.65580|0.65690|
+DGX-1 V100|8|AMP|NVTabular preprocessed|No|0.65633|0.00037|0.65510|0.65680|
+</details>
+
+
+
+
+##### Impact of mixed precision on training accuracy
+
+The accuracy of training, measured with [MAP@12](https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)#Mean_average_precision) on the evaluation set after the final epoch metric was not impacted by enabling mixed precision. The obtained results were statistically similar. The similarity was measured according to the following procedure:
+
+The model was trained 20 times for default settings (FP32 or TF32 for Volta and Ampere architecture respectively) and 20 times for AMP. After the last epoch, the accuracy score MAP@12 was calculated on the evaluation set.
+
+Distributions for four configurations: architecture (A100, V100) and single/multi gpu for 2 datasets are presented below.
+
+<p align="center">
+  <img width="100%" src="./img/amp_influence_spark.svg" />
+  <br>
+  Figure 6. Influence of AMP on MAP@12 distribution for DGX A100 and DGX-1 V100 for single and multi gpu training on Spark dataset. </a>
+</p>
+
+<p align="center">
+  <img width="100%" src="./img/amp_influence_nvtabular.svg" />
+  <br>
+  Figure 7. Influence of AMP on MAP@12 distribution for DGX A100 and DGX-1 V100 for single and multi gpu training on NVTabular dataset.
+</p>
+
+
+
+Distribution scores for full precision training and AMP training were compared in terms of mean, variance and [Kolmogorov–Smirnov test](https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test) to state statistical difference between full precision and AMP results. Refer to the expandable table below.
+
+<details>
+<summary>Full tabular data for AMP influence on MAP@12</summary>
+
+|            |GPUs                   | Dataset | XLA    | Mean MAP@12 for Full precision (TF32 for A100, FP32 for V100) | Std MAP@12 for Full precision (TF32 for A100, FP32 for V100) | Mean MAP@12 for AMP | Std MAP@12 for AMP | KS test value: statistics, p-value |
+| ------------ | ---------------------- | ------- | ------ | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------- | ------------------ | ---------------------------------- |
+| DGX A100   | 1    | NVTabular preprocessed | No      | 0.6565 | 0.0002                                                       | 0.6564                                                       | 0.0003              | 0.2000 (0.8320)    |                                    |
+| DGX A100   | 8    | NVTabular preprocessed | No      | 0.6564 | 0.0002                                                       | 0.6564                                                       | 0.0003              | 0.1500 (0.9831)    |                                    |
+| DGX A100   | 1    | Spark preprocessed     | No      | 0.6554 | 0.0001                                                       | 0.6553                                                       | 0.0002              | 0.2500 (0.5713)    |                                    |
+| DGX A100   | 8    | Spark preprocessed     | No      | 0.6552 | 0.0002                                                       | 0.6552                                                       | 0.0002              | 0.3000 (0.3356)    |                                    |
+| DGX A100   | 1    | NVTabular preprocessed | No      | 0.6564 | 0.0004                                                       | 0.6565                                                       | 0.0004              | 0.1500 (0.9831)    |                                    |
+| DGX A100   | 8    | NVTabular preprocessed | No      | 0.6563 | 0.0004                                                       | 0.6564                                                       | 0.0003              | 0.2500 (0.5713)    |                                    |
+| DGX A100   | 1    | Spark preprocessed     | No      | 0.6554 | 0.0002                                                       | 0.6554                                                       | 0.0001              | 0.1500 (0.9831)    |                                    |
+| DGX A100   | 8    | Spark preprocessed     | No      | 0.6553 | 0.0001                                                       | 0.6552                                                       | 0.0002              | 0.1500 (0.9831))   |                                    |
+| DGX-1 V100 | 1    | NVTabular preprocessed | No      | 0.6564 | 0.0004                                                       | 0.6564                                                       | 0.0003              | 0.1000 (1.0000)    |                                    |
+| DGX-1 V100 | 8    | NVTabular preprocessed | No      | 0.6564 | 0.0001                                                       | 0.6563                                                       | 0.0004              | 0.2500 (0.5713)    |                                    |
+| DGX-1 V100 | 1    | Spark preprocessed     | No      | 0.6554 | 0.0001                                                       | 0.6553                                                       | 0.0001              | 0.2000 (0.8320)    |                                    |
+| DGX-1 V100 | 8    | Spark preprocessed     | No      | 0.6555 | 0.0001                                                       | 0.6554                                                       | 0.0002              | 0.3500 (0.1745)    |                                    |
+| DGX-1 V100 | 1    | NVTabular preprocessed | No      | 0.6565 | 0.0002                                                       | 0.6565                                                       | 0.0003              | 0.1500 (0.9831)    |                                    |
+| DGX-1 V100 | 8    | NVTabular preprocessed | No      | 0.6564 | 0.0001                                                       | 0.6564                                                       | 0.0003              | 0.2000 (0.8320)    |                                    |
+| DGX-1 V100 | 1    | Spark preprocessed     | No      | 0.6553 | 0.0002                                                       | 0.6553                                                       | 0.0002              | 0.2000 (0.8320)    |                                    |
+| DGX-1 V100 | 8    | Spark preprocessed     | No      | 0.6554 | 0.0002                                                       | 0.6555                                                       | 0.0002              | 0.1500 (0.9831)    |                                    |
+
+</details>
+
+#### Training performance results
+
+##### Training performance: NVIDIA DGX A100 (8x A100 80GB)
+
+Our results were obtained by running the benchmark script (`main.py --benchmark`) in the TensorFlow2 NGC container on NVIDIA DGX A100 with (8x A100 80GB) GPUs. 
+
+|GPUs | Batch size / GPU | XLA | Throughput - TF32 (samples/s)|Throughput - mixed precision (samples/s)|Throughput speedup (TF32 - mixed precision)| Strong scaling - TF32|Strong scaling - mixed precision
+| ---- | ---------------- | --- | ----------------------------- | ---------------------------------------- | ------------------------------------------- | --------------------- | -------------------------------- |
+|1|131,072|Yes|1642892|1997414|1.22|1.00|1.00|
+|1|131,072|No|1269638|1355523|1.07|1.00|1.00|
+|8|16,384|Yes|3376438|2508278|0.74|2.06|1.26|
+|8|16,384|No|3351118|2643009|0.79|2.64|1.07|
+
+
+##### Training performance: NVIDIA DGX-1 (8x V100 16GB)
+
+Our results were obtained by running the benchmark script (`main.py --benchmark`) in the TensorFlow2 NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs.
+
+|GPUs | Batch size / GPU | XLA | Throughput - FP32 (samples/s)|Throughput - mixed precision (samples/s)|Throughput speedup (FP32 - mixed precision)| Strong scaling - FP32|Strong scaling - mixed precision
+| ---- | ---------------- | --- | ----------------------------- | ---------------------------------------- | ------------------------------------------- | --------------------- | -------------------------------- |
+|1|131,072|Yes|361202|1091584|3.02|1.00|1.00
+|1|131,072|No|321816|847229|2.63|1.00|1.00
+|8|16,384|Yes|1512691|1731391|1.14|4.19|1.59
+|8|16,384|No|1490044|1837962|1.23|4.63|2.17
+
+
+#### Evaluation performance results
+
+##### Evaluation performance: NVIDIA DGX A100 (8x A100 80GB)
+
+Our results were obtained by running the benchmark script (`main.py --evaluate --benchmark`) in the TensorFlow2 NGC container on NVIDIA DGX A100 with 8x A100 80GB GPUs. 
+
+
+|GPUs|Batch size / GPU|XLA|Throughput \[samples/s\] TF32|Throughput \[samples/s\]AMP|Throughput speedup AMP to TF32
+|----|----------------|---|------------------------------|-----------------------------|-------------------------------
+|1|4096|NO|648058|614053|0.95|
+|1|8192|NO|1063986|1063203|1.00|
+|1|16384|NO|1506679|1573248|1.04|
+|1|32768|NO|1983238|2088212|1.05|
+|1|65536|NO|2280630|2523812|1.11|
+|1|131072|NO|2568911|2915340|1.13|
+|8|4096|NO|4516588|4374181|0.97|
+|8|8192|NO|7715609|7718173|1.00|
+|8|16384|NO|11296845|11624159|1.03|
+|8|32768|NO|14957242|15904745|1.06|
+|8|65536|NO|17671055|19332987|1.09|
+|8|131072|NO|19779711|21761656|1.10|
+
+For more results go to the expandable table below.
+
+<details>
+<summary>Full tabular data for evaluation performance results for DGX A100</summary>
+
+|GPUs|Batch size / GPU|XLA|Throughput \[samples/s\] TF32|Throughput \[samples/s\]AMP|Throughput speedup AMP to TF32
+|----|----------------|---|------------------------------|-----------------------------|-------------------------------
+|1|4096|YES|621024|648441|1.04|
+|1|4096|NO|648058|614053|0.95|
+|1|8192|YES|1068943|1045790|0.98|
+|1|8192|NO|1063986|1063203|1.00|
+|1|16384|YES|1554101|1710186|1.10|
+|1|16384|NO|1506679|1573248|1.04|
+|1|32768|YES|2014216|2363490|1.17|
+|1|32768|NO|1983238|2088212|1.05|
+|1|65536|YES|2010050|2450872|1.22|
+|1|65536|NO|2280630|2523812|1.11|
+|1|131072|YES|2321543|2885393|1.24|
+|1|131072|NO|2568911|2915340|1.13|
+|8|4096|YES|4328154|4445315|1.03|
+|8|4096|NO|4516588|4374181|0.97|
+|8|8192|YES|7410554|7640191|1.03|
+|8|8192|NO|7715609|7718173|1.00|
+|8|16384|YES|11412928|12422567|1.09|
+|8|16384|NO|11296845|11624159|1.03|
+|8|32768|YES|11428369|12525670|1.10|
+|8|32768|NO|14957242|15904745|1.06|
+|8|65536|YES|13453756|15308455|1.14|
+|8|65536|NO|17671055|19332987|1.09|
+|8|131072|YES|17047482|20930042|1.23|
+|8|131072|NO|19779711|21761656|1.10|
+ </details>
+
+
+##### Evaluation performance: NVIDIA DGX-1 (8x V100 16GB)
+
+Our results were obtained by running the benchmark script (`main.py --evaluate --benchmark`) in the TensorFlow2 NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs.
+
+|GPUs|Batch size / GPU|XLA|Throughput \[samples/s\] TF32|Throughput \[samples/s\]AMP|Throughput speedup AMP to TF32
+|----|----------------|---|------------------------------|-----------------------------|-------------------------------
+|1|4096|NO|375928|439395|1.17|
+|1|8192|NO|526780|754517|1.43|
+|1|16384|NO|673971|1133696|1.68|
+|1|32768|NO|791637|1470221|1.86|
+|1|65536|NO|842831|1753500|2.08|
+|1|131072|NO|892941|1990898|2.23|
+|8|4096|NO|2893390|3278473|1.13|
+|8|8192|NO|3881996|5337866|1.38|
+|8|16384|NO|5003135|8086178|1.62|
+|8|32768|NO|6124648|11087247|1.81|
+|8|65536|NO|6631887|13233484|2.00|
+|8|131072|NO|7030438|15081861|2.15|
+
+
+
+For more results go to the expandable table below.
+
+<details>
+<summary>Full tabular data for evaluation performance for DGX-1 V100 results</summary>
+
+|GPUs|Batch size / GPU|XLA|Throughput \[samples/s\] TF32|Throughput \[samples/s\]AMP|Throughput speedup AMP to TF32
+|----|----------------|---|------------------------------|-----------------------------|-------------------------------
+|1|4096|YES|356963|459481|1.29|
+|1|4096|NO|375928|439395|1.17|
+|1|8192|YES|517016|734515|1.42|
+|1|8192|NO|526780|754517|1.43|
+|1|16384|YES|660772|1150292|1.74|
+|1|16384|NO|673971|1133696|1.68|
+|1|32768|YES|776357|1541699|1.99|
+|1|32768|NO|791637|1470221|1.86|
+|1|65536|YES|863311|1962275|2.27|
+|1|65536|NO|842831|1753500|2.08|
+|1|131072|YES|928290|2235968|2.41|
+|1|131072|NO|892941|1990898|2.23|
+|8|4096|YES|2680961|3182591|1.19|
+|8|4096|NO|2893390|3278473|1.13|
+|8|8192|YES|3738172|5185972|1.39|
+|8|8192|NO|3881996|5337866|1.38|
+|8|16384|YES|4961435|8170489|1.65|
+|8|16384|NO|5003135|8086178|1.62|
+|8|32768|YES|6218767|11658218|1.87|
+|8|32768|NO|6124648|11087247|1.81|
+|8|65536|YES|6808677|14921211|2.19|
+|8|65536|NO|6631887|13233484|2.00|
+|8|131072|YES|7205370|16923294|2.35|
+|8|131072|NO|7030438|15081861|2.15|
+ </details>
+
+## Release notes
+
+### Changelog
+
+February 2021
+Initial release
+
+### Known issues
+* In this model the TF32 precision can in some cases be as fast as the FP16 precision on Ampere GPUs. This is because TF32 also uses Tensor Cores and doesn't need any additional logic such as maintaining FP32 master weights and casts. However, please note that W&D is, by modern recommender standards, a very small model. Larger models should still see significant benefits of using FP16 math.
--- a/TensorFlow2/Recommendation/WideAndDeep/data/outbrain/dataloader.py
+++ b/TensorFlow2/Recommendation/WideAndDeep/data/outbrain/dataloader.py
@ -0,0 +1,139 @@
+# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from functools import partial
+from multiprocessing import cpu_count
+
+import tensorflow as tf
+
+from data.outbrain.features import get_features_keys
+
+
+def _consolidate_batch(elem):
+    label = elem.pop('label')
+    reshaped_label = tf.reshape(label, [-1, label.shape[-1]])
+    features = get_features_keys()
+
+    reshaped_elem = {
+        key: tf.reshape(elem[key], [-1, elem[key].shape[-1]])
+        for key in elem
+        if key in features
+    }
+
+    return reshaped_elem, reshaped_label
+
+
+def get_parse_function(feature_spec):
+    def _parse_function(example_proto):
+        return tf.io.parse_single_example(example_proto, feature_spec)
+
+    return _parse_function
+
+
+def train_input_fn(
+        filepath_pattern,
+        feature_spec,
+        records_batch_size,
+        num_gpus=1,
+        id=0):
+    _parse_function = get_parse_function(feature_spec)
+
+    dataset = tf.data.Dataset.list_files(
+        file_pattern=filepath_pattern
+    )
+
+    dataset = dataset.interleave(
+        lambda x: tf.data.TFRecordDataset(x),
+        cycle_length=cpu_count() // num_gpus,
+        block_length=1
+    )
+
+    dataset = dataset.map(
+        map_func=_parse_function,
+        num_parallel_calls=tf.data.experimental.AUTOTUNE
+    )
+
+    dataset = dataset.shard(num_gpus, id)
+
+    dataset = dataset.shuffle(records_batch_size * 8)
+
+    dataset = dataset.repeat(
+        count=None
+    )
+
+    dataset = dataset.batch(
+        batch_size=records_batch_size,
+        drop_remainder=False
+    )
+
+    dataset = dataset.map(
+        map_func=partial(
+            _consolidate_batch
+        ),
+        num_parallel_calls=tf.data.experimental.AUTOTUNE
+    )
+
+    dataset = dataset.prefetch(
+        buffer_size=tf.data.experimental.AUTOTUNE
+    )
+
+    return dataset
+
+
+def eval_input_fn(
+        filepath_pattern,
+        feature_spec,
+        records_batch_size,
+        num_gpus=1,
+        repeat=1,
+        id=0):
+    dataset = tf.data.Dataset.list_files(
+        file_pattern=filepath_pattern,
+        shuffle=False
+    )
+
+    dataset = tf.data.TFRecordDataset(
+        filenames=dataset,
+        num_parallel_reads=1
+    )
+
+    dataset = dataset.shard(num_gpus, id)
+
+    dataset = dataset.repeat(
+        count=repeat
+    )
+
+    dataset = dataset.batch(
+        batch_size=records_batch_size,
+        drop_remainder=False
+    )
+
+    dataset = dataset.apply(
+        transformation_func=tf.data.experimental.parse_example_dataset(
+            features=feature_spec,
+            num_parallel_calls=1
+        )
+    )
+
+    dataset = dataset.map(
+        map_func=partial(
+            _consolidate_batch
+        ),
+        num_parallel_calls=None
+    )
+    dataset = dataset.prefetch(
+        buffer_size=1
+    )
+
+    return dataset
--- a/TensorFlow2/Recommendation/WideAndDeep/data/outbrain/features.py
+++ b/TensorFlow2/Recommendation/WideAndDeep/data/outbrain/features.py
@ -0,0 +1,131 @@
+# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+
+import tensorflow as tf
+
+PREBATCH_SIZE = 4096
+DISPLAY_ID_COLUMN = 'display_id'
+
+TIME_COLUMNS = [
+    'doc_event_days_since_published_log_01scaled',
+    'doc_ad_days_since_published_log_01scaled'
+]
+
+GB_COLUMNS = [
+    'pop_document_id',
+    'pop_publisher_id',
+    'pop_source_id',
+    'pop_ad_id',
+    'pop_advertiser_id',
+    'pop_campain_id',
+    'doc_views_log_01scaled',
+    'ad_views_log_01scaled'
+]
+
+SIM_COLUMNS = [
+    'doc_event_doc_ad_sim_categories',
+    'doc_event_doc_ad_sim_topics',
+    'doc_event_doc_ad_sim_entities'
+]
+
+NUMERIC_COLUMNS = TIME_COLUMNS + SIM_COLUMNS + GB_COLUMNS
+
+CATEGORICAL_COLUMNS = [
+    'ad_id',
+    'campaign_id',
+    'doc_event_id',
+    'event_platform',
+    'doc_id',
+    'ad_advertiser',
+    'doc_event_source_id',
+    'doc_event_publisher_id',
+    'doc_ad_source_id',
+    'doc_ad_publisher_id',
+    'event_geo_location',
+    'event_country',
+    'event_country_state',
+]
+
+HASH_BUCKET_SIZES = {
+    'doc_event_id': 300000,
+    'ad_id': 250000,
+    'doc_id': 100000,
+    'doc_ad_source_id': 4000,
+    'doc_event_source_id': 4000,
+    'event_geo_location': 2500,
+    'ad_advertiser': 2500,
+    'event_country_state': 2000,
+    'doc_ad_publisher_id': 1000,
+    'doc_event_publisher_id': 1000,
+    'event_country': 300,
+    'event_platform': 4,
+    'campaign_id': 5000
+}
+
+EMBEDDING_DIMENSIONS = {
+    'doc_event_id': 128,
+    'ad_id': 128,
+    'doc_id': 128,
+    'doc_ad_source_id': 64,
+    'doc_event_source_id': 64,
+    'event_geo_location': 64,
+    'ad_advertiser': 64,
+    'event_country_state': 64,
+    'doc_ad_publisher_id': 64,
+    'doc_event_publisher_id': 64,
+    'event_country': 64,
+    'event_platform': 16,
+    'campaign_id': 128
+}
+
+EMBEDDING_TABLE_SHAPES = {
+    column: (HASH_BUCKET_SIZES[column], EMBEDDING_DIMENSIONS[column]) for column in CATEGORICAL_COLUMNS
+}
+
+
+def get_features_keys():
+    return CATEGORICAL_COLUMNS + NUMERIC_COLUMNS + [DISPLAY_ID_COLUMN]
+
+
+def get_feature_columns():
+    logger = logging.getLogger('tensorflow')
+    wide_columns, deep_columns = [], []
+
+    for column_name in CATEGORICAL_COLUMNS:
+        if column_name in EMBEDDING_TABLE_SHAPES:
+            categorical_column = tf.feature_column.categorical_column_with_identity(
+                column_name, num_buckets=EMBEDDING_TABLE_SHAPES[column_name][0])
+            wrapped_column = tf.feature_column.embedding_column(
+                categorical_column,
+                dimension=EMBEDDING_TABLE_SHAPES[column_name][1],
+                combiner='mean')
+        else:
+            raise ValueError(f'Unexpected categorical column found {column_name}')
+
+        wide_columns.append(categorical_column)
+        deep_columns.append(wrapped_column)
+
+    numerics = [tf.feature_column.numeric_column(column_name, shape=(1,), dtype=tf.float32)
+                for column_name in NUMERIC_COLUMNS]
+
+    wide_columns.extend(numerics)
+    deep_columns.extend(numerics)
+
+    logger.warning('deep columns: {}'.format(len(deep_columns)))
+    logger.warning('wide columns: {}'.format(len(wide_columns)))
+    logger.warning('wide&deep intersection: {}'.format(len(set(wide_columns).intersection(set(deep_columns)))))
+
+    return wide_columns, deep_columns
--- a/TensorFlow2/Recommendation/WideAndDeep/data/outbrain/nvtabular/preproc.py
+++ b/TensorFlow2/Recommendation/WideAndDeep/data/outbrain/nvtabular/preproc.py
@ -0,0 +1,51 @@
+# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+import os
+
+os.environ['TF_MEMORY_ALLOCATION'] = "0.0"
+from data.outbrain.nvtabular.utils.converter import nvt_to_tfrecords
+from data.outbrain.nvtabular.utils.workflow import execute_pipeline
+from data.outbrain.nvtabular.utils.arguments import parse_args
+from data.outbrain.nvtabular.utils.setup import create_config
+
+
+def is_empty(path):
+    return not os.path.exists(path) or (not os.path.isfile(path) and not os.listdir(path))
+
+
+def main():
+    args = parse_args()
+    config = create_config(args)
+    if is_empty(args.metadata_path):
+        logging.warning('Creating new stats data into {}'.format(config['stats_file']))
+        execute_pipeline(config)
+    else:
+        logging.warning('Directory is not empty {args.metadata_path}')
+        logging.warning('Skipping NVTabular preprocessing')
+
+    if os.path.exists(config['output_train_folder']) and os.path.exists(config['output_valid_folder']):
+        if is_empty(config['tfrecords_path']):
+            logging.warning('Executing NVTabular parquets to TFRecords conversion')
+            nvt_to_tfrecords(config)
+        else:
+            logging.warning(f"Directory is not empty {config['tfrecords_path']}")
+            logging.warning('Skipping TFrecords conversion')
+    else:
+        logging.warning(f'Train and validation dataset not found in {args.metadata_path}')
+
+
+if __name__ == '__main__':
+    main()
--- a/TensorFlow2/Recommendation/WideAndDeep/data/outbrain/nvtabular/utils/arguments.py
+++ b/TensorFlow2/Recommendation/WideAndDeep/data/outbrain/nvtabular/utils/arguments.py
@ -0,0 +1,52 @@
+# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+DEFAULT_DIR = '/outbrain'
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument(
+        '--data_path',
+        help='Path with the data required for NVTabular preprocessing. '
+             'If stats already exists under metadata_path preprocessing phase will be skipped.',
+        type=str,
+        default=f'{DEFAULT_DIR}/orig',
+        nargs='+'
+    )
+    parser.add_argument(
+        '--metadata_path',
+        help='Path with preprocessed NVTabular stats',
+        type=str,
+        default=f'{DEFAULT_DIR}/data',
+        nargs='+'
+    )
+    parser.add_argument(
+        '--tfrecords_path',
+        help='Path where converted tfrecords will be stored',
+        type=str,
+        default=f'{DEFAULT_DIR}/tfrecords',
+        nargs='+'
+    )
+    parser.add_argument(
+        '--workers',
+        help='Number of TfRecords files to be created',
+        type=int,
+        default=40
+    )
+
+    return parser.parse_args()
--- a/TensorFlow2/Recommendation/WideAndDeep/data/outbrain/nvtabular/utils/converter.py
+++ b/TensorFlow2/Recommendation/WideAndDeep/data/outbrain/nvtabular/utils/converter.py
@ -0,0 +1,158 @@
+# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+import os
+from multiprocessing import Process
+
+import pandas as pd
+import tensorflow as tf
+from tensorflow_transform.tf_metadata import dataset_metadata
+from tensorflow_transform.tf_metadata import dataset_schema
+from tensorflow_transform.tf_metadata import metadata_io
+
+from data.outbrain.features import PREBATCH_SIZE
+from data.outbrain.nvtabular.utils.feature_description import transform_nvt_to_spark, CATEGORICAL_COLUMNS, \
+    DISPLAY_ID_COLUMN, EXCLUDE_COLUMNS
+
+
+def create_metadata(df, prebatch_size, output_path):
+    fixed_shape = [prebatch_size, 1]
+    spec = {}
+    for column in df:
+        if column in CATEGORICAL_COLUMNS + [DISPLAY_ID_COLUMN]:
+            spec[transform_nvt_to_spark(column)] = tf.io.FixedLenFeature(shape=fixed_shape, dtype=tf.int64,
+                                                                         default_value=None)
+        else:
+            spec[transform_nvt_to_spark(column)] = tf.io.FixedLenFeature(shape=fixed_shape, dtype=tf.float32,
+                                                                         default_value=None)
+    metadata = dataset_metadata.DatasetMetadata(dataset_schema.from_feature_spec(spec))
+    metadata_io.write_metadata(metadata, output_path)
+
+
+def create_tf_example(df, start_index, offset):
+    parsed_features = {}
+    records = df.loc[start_index:start_index + offset - 1]
+    for column in records:
+        if column in CATEGORICAL_COLUMNS + [DISPLAY_ID_COLUMN]:
+            feature = tf.train.Feature(int64_list=tf.train.Int64List(value=records[column].to_numpy()))
+        else:
+            feature = tf.train.Feature(float_list=tf.train.FloatList(value=records[column].to_numpy()))
+        parsed_features[transform_nvt_to_spark(column)] = feature
+    features = tf.train.Features(feature=parsed_features)
+    return tf.train.Example(features=features)
+
+
+def create_tf_records(df, prebatch_size, output_path):
+    with tf.io.TFRecordWriter(output_path) as out_file:
+        start_index = df.index[0]
+        for index in range(start_index, df.shape[0] + start_index - prebatch_size + 1, prebatch_size):
+            example = create_tf_example(df, index, prebatch_size)
+            out_file.write(example.SerializeToString())
+
+
+def convert(path_to_nvt_dataset, output_path, prebatch_size, exclude_columns, workers=6):
+    train_path = os.path.join(path_to_nvt_dataset, 'train')
+    valid_path = os.path.join(path_to_nvt_dataset, 'valid')
+    output_metadata_path = os.path.join(output_path, 'transformed_metadata')
+    output_train_path = os.path.join(output_path, 'train')
+    output_valid_path = os.path.join(output_path, 'eval')
+
+    for directory in [output_metadata_path, output_train_path, output_valid_path]:
+        os.makedirs(directory, exist_ok=True)
+
+    train_workers, valid_workers = [], []
+    output_train_paths, output_valid_paths = [], []
+
+    for worker in range(workers):
+        part_number = str(worker).rjust(5, '0')
+        record_train_path = os.path.join(output_train_path, f'part-r-{part_number}')
+        record_valid_path = os.path.join(output_valid_path, f'part-r-{part_number}')
+        output_train_paths.append(record_train_path)
+        output_valid_paths.append(record_valid_path)
+
+    logging.warning(f'Prebatch size set to {prebatch_size}')
+    logging.warning(f'Number of TFRecords set to {workers}')
+
+    logging.warning(f'Reading training parquets from {train_path}')
+    df_train = pd.read_parquet(train_path, engine='pyarrow')
+    logging.warning('Done')
+
+    logging.warning(f'Removing training columns {exclude_columns}')
+    df_train = df_train.drop(columns=exclude_columns)
+    logging.warning('Done')
+
+    logging.warning(f'Creating metadata in {output_metadata_path}')
+    metadata_worker = Process(target=create_metadata, args=(df_train, prebatch_size, output_metadata_path))
+    metadata_worker.start()
+
+    logging.warning(f'Creating training TFrecords to {output_train_paths}')
+
+    shape = df_train.shape[0] // workers
+    shape = shape + (prebatch_size - shape % prebatch_size)
+
+    for worker_index in range(workers):
+        df_subset = df_train.loc[worker_index * shape:(worker_index + 1) * shape - 1]
+        worker = Process(target=create_tf_records, args=(df_subset, prebatch_size, output_train_paths[worker_index]))
+        train_workers.append(worker)
+
+    for worker in train_workers:
+        worker.start()
+
+    logging.warning(f'Reading validation parquets from {valid_path}')
+    df_valid = pd.read_parquet(valid_path, engine='pyarrow')
+    logging.warning('Done')
+
+    logging.warning(f'Removing validation columns {exclude_columns}')
+    df_valid = df_valid.drop(columns=exclude_columns)
+    logging.warning('Done')
+
+    logging.warning(f'Creating validation TFrecords to {output_valid_paths}')
+
+    shape = df_valid.shape[0] // workers
+    shape = shape + (prebatch_size - shape % prebatch_size)
+
+    for worker_index in range(workers):
+        df_subset = df_valid.loc[worker_index * shape:(worker_index + 1) * shape - 1]
+        worker = Process(target=create_tf_records, args=(df_subset, prebatch_size, output_valid_paths[worker_index]))
+        valid_workers.append(worker)
+
+    for worker in valid_workers:
+        worker.start()
+
+    for worker_index in range(workers):
+        metadata_worker.join()
+        train_workers[worker_index].join()
+        valid_workers[worker_index].join()
+
+    logging.warning('Done')
+
+    del df_train
+    del df_valid
+
+    return output_path
+
+
+def nvt_to_tfrecords(config):
+    path_to_nvt_dataset = config['output_bucket_folder']
+    output_path = config['tfrecords_path']
+    workers = config['workers']
+
+    convert(
+        path_to_nvt_dataset=path_to_nvt_dataset,
+        output_path=output_path,
+        prebatch_size=PREBATCH_SIZE,
+        exclude_columns=EXCLUDE_COLUMNS,
+        workers=workers
+    )
--- a/TensorFlow2/Recommendation/WideAndDeep/data/outbrain/nvtabular/utils/feature_description.py
+++ b/TensorFlow2/Recommendation/WideAndDeep/data/outbrain/nvtabular/utils/feature_description.py
@ -0,0 +1,108 @@
+# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+DISPLAY_ID_COLUMN = 'display_id'
+
+BASE_CONT_COLUMNS = ['publish_time', 'publish_time_promo', 'timestamp', 'document_id_promo_clicked_sum_ctr',
+                     'publisher_id_promo_clicked_sum_ctr',
+                     'source_id_promo_clicked_sum_ctr', 'document_id_promo_count', 'publish_time_days_since_published',
+                     'ad_id_clicked_sum_ctr',
+                     'advertiser_id_clicked_sum_ctr', 'campaign_id_clicked_sum_ctr', 'ad_id_count',
+                     'publish_time_promo_days_since_published']
+
+SIM_COLUMNS = [
+    'doc_event_doc_ad_sim_categories',
+    'doc_event_doc_ad_sim_topics',
+    'doc_event_doc_ad_sim_entities'
+]
+
+CONTINUOUS_COLUMNS = BASE_CONT_COLUMNS + SIM_COLUMNS + [DISPLAY_ID_COLUMN]
+
+groupby_columns = ['ad_id_count', 'ad_id_clicked_sum', 'source_id_promo_count', 'source_id_promo_clicked_sum',
+                   'document_id_promo_count', 'document_id_promo_clicked_sum',
+                   'publisher_id_promo_count', 'publisher_id_promo_clicked_sum', 'advertiser_id_count',
+                   'advertiser_id_clicked_sum',
+                   'campaign_id_count', 'campaign_id_clicked_sum']
+
+ctr_columns = ['advertiser_id_clicked_sum_ctr', 'document_id_promo_clicked_sum_ctr',
+               'publisher_id_promo_clicked_sum_ctr',
+               'source_id_promo_clicked_sum_ctr',
+               'ad_id_clicked_sum_ctr', 'campaign_id_clicked_sum_ctr']
+
+exclude_conts = ['publish_time', 'publish_time_promo', 'timestamp']
+
+NUMERIC_COLUMNS = [col for col in CONTINUOUS_COLUMNS if col not in exclude_conts]
+
+CATEGORICAL_COLUMNS = ['ad_id', 'document_id', 'platform', 'document_id_promo', 'campaign_id', 'advertiser_id',
+                       'source_id',
+                       'publisher_id', 'source_id_promo', 'publisher_id_promo', 'geo_location', 'geo_location_country',
+                       'geo_location_state']
+
+EXCLUDE_COLUMNS = [
+    'publish_time',
+    'publish_time_promo',
+    'timestamp',
+    'ad_id_clicked_sum',
+    'source_id_promo_count',
+    'source_id_promo_clicked_sum',
+    'document_id_promo_clicked_sum',
+    'publisher_id_promo_count', 'publisher_id_promo_clicked_sum',
+    'advertiser_id_count',
+    'advertiser_id_clicked_sum',
+    'campaign_id_count',
+    'campaign_id_clicked_sum',
+    'uuid',
+    'day_event'
+]
+
+nvt_to_spark = {
+    'ad_id': 'ad_id',
+    'clicked': 'label',
+    'display_id': 'display_id',
+    'document_id': 'doc_event_id',
+    'platform': 'event_platform',
+    'document_id_promo': 'doc_id',
+    'campaign_id': 'campaign_id',
+    'advertiser_id': 'ad_advertiser',
+    'source_id': 'doc_event_source_id',
+    'publisher_id': 'doc_event_publisher_id',
+    'source_id_promo': 'doc_ad_source_id',
+    'publisher_id_promo': 'doc_ad_publisher_id',
+    'geo_location': 'event_geo_location',
+    'geo_location_country': 'event_country',
+    'geo_location_state': 'event_country_state',
+    'document_id_promo_clicked_sum_ctr': 'pop_document_id',
+    'publisher_id_promo_clicked_sum_ctr': 'pop_publisher_id',
+    'source_id_promo_clicked_sum_ctr': 'pop_source_id',
+    'document_id_promo_count': 'doc_views_log_01scaled',
+    'publish_time_days_since_published': 'doc_event_days_since_published_log_01scaled',
+    'ad_id_clicked_sum_ctr': 'pop_ad_id',
+    'advertiser_id_clicked_sum_ctr': 'pop_advertiser_id',
+    'campaign_id_clicked_sum_ctr': 'pop_campain_id',
+    'ad_id_count': 'ad_views_log_01scaled',
+    'publish_time_promo_days_since_published': 'doc_ad_days_since_published_log_01scaled',
+    'doc_event_doc_ad_sim_categories': 'doc_event_doc_ad_sim_categories',
+    'doc_event_doc_ad_sim_topics': 'doc_event_doc_ad_sim_topics',
+    'doc_event_doc_ad_sim_entities': 'doc_event_doc_ad_sim_entities'
+}
+
+spark_to_nvt = {item: key for key, item in nvt_to_spark.items()}
+
+
+def transform_nvt_to_spark(column):
+    return nvt_to_spark[column]
+
+
+def transform_spark_to_nvt(column):
+    return spark_to_nvt[column]
--- a/TensorFlow2/Recommendation/WideAndDeep/data/outbrain/nvtabular/utils/setup.py
+++ b/TensorFlow2/Recommendation/WideAndDeep/data/outbrain/nvtabular/utils/setup.py
@ -0,0 +1,48 @@
+# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+from data.outbrain.features import HASH_BUCKET_SIZES
+from data.outbrain.nvtabular.utils.feature_description import transform_spark_to_nvt
+
+
+def create_config(args):
+    stats_file = os.path.join(args.metadata_path, 'stats_wnd_workflow')
+    data_bucket_folder = args.data_path
+    output_bucket_folder = args.metadata_path
+    output_train_folder = os.path.join(output_bucket_folder, 'train/')
+    temporary_folder = os.path.join('/tmp', 'preprocessed')
+    train_path = os.path.join(temporary_folder, 'train_gdf.parquet')
+    valid_path = os.path.join(temporary_folder, 'valid_gdf.parquet')
+    output_valid_folder = os.path.join(output_bucket_folder, 'valid/')
+    tfrecords_path = args.tfrecords_path
+    workers = args.workers
+    hash_spec = {transform_spark_to_nvt(column): hash for column, hash in HASH_BUCKET_SIZES.items()}
+
+    config = {
+        'stats_file': stats_file,
+        'data_bucket_folder': data_bucket_folder,
+        'output_bucket_folder': output_bucket_folder,
+        'output_train_folder': output_train_folder,
+        'temporary_folder': temporary_folder,
+        'train_path': train_path,
+        'valid_path': valid_path,
+        'output_valid_folder': output_valid_folder,
+        'tfrecords_path': tfrecords_path,
+        'workers': workers,
+        'hash_spec': hash_spec
+    }
+
+    return config
--- a/TensorFlow2/Recommendation/WideAndDeep/data/outbrain/nvtabular/utils/workflow.py
+++ b/TensorFlow2/Recommendation/WideAndDeep/data/outbrain/nvtabular/utils/workflow.py
@ -0,0 +1,254 @@
+# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import shutil
+
+import cudf
+import cupy
+import nvtabular as nvt
+import rmm
+from dask.distributed import Client
+from dask_cuda import LocalCUDACluster
+from data.outbrain.nvtabular.utils.feature_description import CATEGORICAL_COLUMNS, CONTINUOUS_COLUMNS, \
+    DISPLAY_ID_COLUMN, groupby_columns, ctr_columns
+from nvtabular.io import Shuffle
+from nvtabular.ops import Normalize, FillMedian, FillMissing, LogOp, LambdaOp, JoinGroupby, HashBucket
+from nvtabular.ops.column_similarity import ColumnSimilarity
+from nvtabular.utils import device_mem_size, get_rmm_size
+
+TIMESTAMP_DELTA = 1465876799998
+
+
+def get_devices():
+    try:
+        devices = [int(device) for device in os.environ["CUDA_VISIBLE_DEVICES"].split(",")]
+    except KeyError:
+        from pynvml import nvmlInit, nvmlDeviceGetCount
+        nvmlInit()
+        devices = list(range(nvmlDeviceGetCount()))
+    return devices
+
+
+def _calculate_delta(col, gdf):
+    col.loc[col == ''] = None
+    col = col.astype('datetime64[ns]')
+    timestamp = (gdf['timestamp'] + TIMESTAMP_DELTA).astype('datetime64[ms]')
+    delta = (timestamp - col).dt.days
+    delta = delta * (delta >= 0) * (delta <= 10 * 365)
+    return delta
+
+
+def _df_to_coo(df, row='document_id', col=None, data='confidence_level'):
+    return cupy.sparse.coo_matrix((df[data].values, (df[row].values, df[col].values)))
+
+
+def setup_rmm_pool(client, pool_size):
+    pool_size = get_rmm_size(pool_size)
+    client.run(rmm.reinitialize, pool_allocator=True, initial_pool_size=pool_size)
+    return None
+
+
+def create_client(devices, local_directory):
+    client = None
+
+    if len(devices) > 1:
+        device_size = device_mem_size(kind="total")
+        device_limit = int(0.8 * device_size)
+        device_pool_size = int(0.8 * device_size)
+        cluster = LocalCUDACluster(
+            n_workers=len(devices),
+            CUDA_VISIBLE_DEVICES=",".join(str(x) for x in devices),
+            device_memory_limit=device_limit,
+            local_directory=local_directory
+        )
+        client = Client(cluster)
+        setup_rmm_pool(client, device_pool_size)
+
+    return client
+
+
+def create_workflow(data_bucket_folder, output_bucket_folder, hash_spec, devices, local_directory):
+    rmm.reinitialize(managed_memory=False)
+    documents_categories_path = os.path.join(data_bucket_folder, 'documents_categories.csv')
+    documents_topics_path = os.path.join(data_bucket_folder, 'documents_topics.csv')
+    documents_entities_path = os.path.join(data_bucket_folder, 'documents_entities.csv')
+
+    documents_categories_cudf = cudf.read_csv(documents_categories_path)
+    documents_topics_cudf = cudf.read_csv(documents_topics_path)
+    documents_entities_cudf = cudf.read_csv(documents_entities_path)
+    documents_entities_cudf['entity_id'] = documents_entities_cudf['entity_id'].astype('category').cat.codes
+
+    categories = _df_to_coo(documents_categories_cudf, col='category_id')
+    topics = _df_to_coo(documents_topics_cudf, col='topic_id')
+    entities = _df_to_coo(documents_entities_cudf, col='entity_id')
+
+    del documents_categories_cudf, documents_topics_cudf, documents_entities_cudf
+    ctr_thresh = {
+        'ad_id': 5,
+        'source_id_promo': 10,
+        'publisher_id_promo': 10,
+        'advertiser_id': 10,
+        'campaign_id': 10,
+        'document_id_promo': 5,
+
+    }
+
+    client = create_client(
+        devices=devices,
+        local_directory=local_directory
+    )
+
+    workflow = nvt.Workflow(
+        cat_names=CATEGORICAL_COLUMNS,
+        cont_names=CONTINUOUS_COLUMNS,
+        label_name=['clicked'],
+        client=client
+    )
+
+    workflow.add_feature([
+        LambdaOp(
+            op_name='country',
+            f=lambda col, gdf: col.str.slice(0, 2),
+            columns=['geo_location'], replace=False),
+        LambdaOp(
+            op_name='state',
+            f=lambda col, gdf: col.str.slice(0, 5),
+            columns=['geo_location'], replace=False),
+        LambdaOp(
+            op_name='days_since_published',
+            f=_calculate_delta,
+            columns=['publish_time', 'publish_time_promo'], replace=False),
+
+        FillMedian(columns=['publish_time_days_since_published', 'publish_time_promo_days_since_published']),
+
+        JoinGroupby(columns=['ad_id', 'source_id_promo', 'document_id_promo', 'publisher_id_promo', 'advertiser_id',
+                             'campaign_id'],
+                    cont_names=['clicked'], out_path=output_bucket_folder, stats=['sum', 'count']),
+        LambdaOp(
+            op_name='ctr',
+            f=lambda col, gdf: ((col) / (gdf[col.name.replace('_clicked_sum', '_count')])).where(
+                gdf[col.name.replace('_clicked_sum', '_count')] >= ctr_thresh[col.name.replace('_clicked_sum', '')], 0),
+            columns=['ad_id_clicked_sum', 'source_id_promo_clicked_sum', 'document_id_promo_clicked_sum',
+                     'publisher_id_promo_clicked_sum',
+                     'advertiser_id_clicked_sum', 'campaign_id_clicked_sum'], replace=False),
+        FillMissing(columns=groupby_columns + ctr_columns),
+        LogOp(
+            columns=groupby_columns + ['publish_time_days_since_published', 'publish_time_promo_days_since_published']),
+        Normalize(columns=groupby_columns),
+        ColumnSimilarity('doc_event_doc_ad_sim_categories', 'document_id', categories, 'document_id_promo',
+                         metric='tfidf', on_device=False),
+        ColumnSimilarity('doc_event_doc_ad_sim_topics', 'document_id', topics, 'document_id_promo', metric='tfidf',
+                         on_device=False),
+        ColumnSimilarity('doc_event_doc_ad_sim_entities', 'document_id', entities, 'document_id_promo', metric='tfidf',
+                         on_device=False)
+    ])
+
+    workflow.add_cat_preprocess([
+        HashBucket(hash_spec)
+    ])
+    workflow.finalize()
+
+    return workflow
+
+
+def create_parquets(data_bucket_folder, train_path, valid_path):
+    cupy.random.seed(seed=0)
+    rmm.reinitialize(managed_memory=True)
+    documents_meta_path = os.path.join(data_bucket_folder, 'documents_meta.csv')
+    clicks_train_path = os.path.join(data_bucket_folder, 'clicks_train.csv')
+    events_path = os.path.join(data_bucket_folder, 'events.csv')
+    promoted_content_path = os.path.join(data_bucket_folder, 'promoted_content.csv')
+
+    documents_meta = cudf.read_csv(documents_meta_path, na_values=['\\N', ''])
+    documents_meta = documents_meta.dropna(subset='source_id')
+    documents_meta['publisher_id'].fillna(
+        documents_meta['publisher_id'].isnull().cumsum() + documents_meta['publisher_id'].max() + 1, inplace=True)
+    merged = (cudf.read_csv(clicks_train_path, na_values=['\\N', ''])
+              .merge(cudf.read_csv(events_path, na_values=['\\N', '']), on=DISPLAY_ID_COLUMN, how='left',
+                     suffixes=('', '_event'))
+              .merge(cudf.read_csv(promoted_content_path, na_values=['\\N', '']), on='ad_id',
+                     how='left',
+                     suffixes=('', '_promo'))
+              .merge(documents_meta, on='document_id', how='left')
+              .merge(documents_meta, left_on='document_id_promo', right_on='document_id', how='left',
+                     suffixes=('', '_promo')))
+    merged['day_event'] = (merged['timestamp'] / 1000 / 60 / 60 / 24).astype(int)
+    merged['platform'] = merged['platform'].fillna(1)
+    merged['platform'] = merged['platform'] - 1
+    display_event = merged[[DISPLAY_ID_COLUMN, 'day_event']].drop_duplicates().reset_index()
+    random_state = cudf.Series(cupy.random.uniform(size=len(display_event)))
+    valid_ids, train_ids = display_event.scatter_by_map(
+        ((display_event.day_event <= 10) & (random_state > 0.2)).astype(int))
+    valid_ids = valid_ids[DISPLAY_ID_COLUMN].drop_duplicates()
+    train_ids = train_ids[DISPLAY_ID_COLUMN].drop_duplicates()
+    valid_set = merged[merged[DISPLAY_ID_COLUMN].isin(valid_ids)]
+    train_set = merged[merged[DISPLAY_ID_COLUMN].isin(train_ids)]
+    valid_set = valid_set.sort_values(DISPLAY_ID_COLUMN)
+    train_set.to_parquet(train_path, compression=None)
+    valid_set.to_parquet(valid_path, compression=None)
+    del merged, train_set, valid_set
+
+
+def save_stats(data_bucket_folder, output_bucket_folder,
+               output_train_folder, train_path, output_valid_folder,
+               valid_path, stats_file, hash_spec, local_directory):
+    devices = get_devices()
+    shuffle = Shuffle.PER_PARTITION if len(devices) > 1 else True
+
+    workflow = create_workflow(data_bucket_folder=data_bucket_folder,
+                               output_bucket_folder=output_bucket_folder,
+                               hash_spec=hash_spec,
+                               devices=devices,
+                               local_directory=local_directory)
+
+    train_dataset = nvt.Dataset(train_path, part_mem_fraction=0.12)
+    valid_dataset = nvt.Dataset(valid_path, part_mem_fraction=0.12)
+
+    workflow.apply(train_dataset, record_stats=True, output_path=output_train_folder, shuffle=shuffle,
+                   out_files_per_proc=5)
+    workflow.apply(valid_dataset, record_stats=False, output_path=output_valid_folder, shuffle=None,
+                   out_files_per_proc=None)
+
+    workflow.save_stats(stats_file)
+
+    return workflow
+
+
+def clean(path):
+    shutil.rmtree(path)
+
+
+def execute_pipeline(config):
+    required_folders = [config['temporary_folder'], config['output_train_folder'], config['output_valid_folder']]
+    for folder in required_folders:
+        os.makedirs(folder, exist_ok=True)
+
+    create_parquets(
+        data_bucket_folder=config['data_bucket_folder'],
+        train_path=config['train_path'],
+        valid_path=config['valid_path']
+    )
+    save_stats(
+        data_bucket_folder=config['data_bucket_folder'],
+        output_bucket_folder=config['output_bucket_folder'],
+        output_train_folder=config['output_train_folder'],
+        train_path=config['train_path'],
+        output_valid_folder=config['output_valid_folder'],
+        valid_path=config['valid_path'],
+        stats_file=config['stats_file'],
+        hash_spec=config['hash_spec'],
+        local_directory=config['temporary_folder']
+    )
+    clean(config['temporary_folder'])
--- a/TensorFlow2/Recommendation/WideAndDeep/data/outbrain/spark/data/ca_states_abbrev_bst.csv
+++ b/TensorFlow2/Recommendation/WideAndDeep/data/outbrain/spark/data/ca_states_abbrev_bst.csv
@ -0,0 +1,13 @@
+state_abb,utc_dst_time_offset_cleaned
+AB,-6.0
+BC,-7.0
+MB,-5.0
+NB,-3.0
+NL,-3.0
+NS,-3.0
+NU,-5.0
+ON,-4.0
+PE,-3.0
+QC,-4.0
+SK,-6.0
+YT,-7.0
--- a/TensorFlow2/Recommendation/WideAndDeep/data/outbrain/spark/data/country_codes_utc_dst_tz_delta.csv
+++ b/TensorFlow2/Recommendation/WideAndDeep/data/outbrain/spark/data/country_codes_utc_dst_tz_delta.csv
@ -0,0 +1,247 @@
+country_code,utc_dst_time_offset_cleaned
+AX,3.0
+AF,4.5
+AL,2.0
+DZ,1.0
+AD,2.0
+AO,1.0
+AI,-4.0
+AG,-4.0
+AR,-3.0
+AM,4.0
+AW,-4.0
+AU,10.0
+AT,2.0
+AZ,4.0
+BS,-4.0
+BH,3.0
+BD,6.0
+BB,-4.0
+BY,3.0
+BE,2.0
+BZ,-6.0
+BJ,1.0
+BM,-3.0
+BT,6.0
+BO,-4.0
+BA,2.0
+BW,2.0
+BR,-3.0
+IO,6.0
+BN,8.0
+BG,3.0
+BF,0.0
+BI,2.0
+KH,7.0
+CM,1.0
+CA,-5.0
+BQ,-5.0
+KY,-5.0
+CF,1.0
+TD,1.0
+CL,-3.0
+CN,8.0
+CX,7.0
+CC,6.5
+CO,-5.0
+KM,3.0
+CD,1.0
+CG,1.0
+CK,-10.0
+CR,-6.0
+CI,0.0
+HR,2.0
+CW,-4.0
+CY,3.0
+CZ,2.0
+DK,2.0
+DJ,3.0
+DM,-4.0
+DO,-4.0
+TL,9.0
+EC,-5.0
+EG,2.0
+SV,-6.0
+GQ,1.0
+ER,3.0
+EE,3.0
+ET,3.0
+FK,-3.0
+FO,1.0
+FJ,12.0
+FI,3.0
+FR,2.0
+GF,-3.0
+PF,-10.0
+GA,1.0
+GM,0.0
+GE,4.0
+DE,2.0
+GH,0.0
+GI,2.0
+GR,3.0
+GL,-2.0
+GD,-4.0
+GP,-4.0
+GU,10.0
+GT,-6.0
+GG,1.0
+GN,0.0
+GW,0.0
+GY,-4.0
+HT,-5.0
+HN,-6.0
+HK,8.0
+HU,2.0
+IS,0.0
+IN,5.5
+ID,8.0
+IR,4.5
+IQ,3.0
+IE,1.0
+IM,1.0
+IL,3.0
+IT,2.0
+JM,-5.0
+JP,9.0
+JE,1.0
+JO,3.0
+KZ,5.0
+KE,3.0
+KI,13.0
+KP,-4.0
+KR,-4.0
+KP,8.5
+KR,8.5
+KP,9.0
+KR,9.0
+KW,3.0
+KG,6.0
+LA,7.0
+LV,3.0
+LB,3.0
+LS,2.0
+LR,0.0
+LY,2.0
+LI,2.0
+LT,3.0
+LU,2.0
+MO,8.0
+MK,2.0
+MG,3.0
+MW,2.0
+MY,8.0
+MV,5.0
+ML,0.0
+MT,2.0
+MH,12.0
+MQ,-4.0
+MR,0.0
+MU,4.0
+YT,3.0
+MX,-5.0
+FM,10.0
+MD,3.0
+MC,2.0
+MN,9.0
+ME,2.0
+MS,-4.0
+MA,1.0
+MZ,2.0
+MM,6.5
+NA,1.0
+NR,12.0
+NP,5.0
+NL,2.0
+NC,11.0
+NZ,12.0
+NI,-6.0
+NE,1.0
+NG,1.0
+NU,-11.0
+NF,11.0
+MP,10.0
+NO,2.0
+OM,4.0
+PK,5.0
+PW,9.0
+PS,3.0
+PA,-5.0
+PG,10.0
+PY,-4.0
+PE,-5.0
+PH,8.0
+PN,-8.0
+PL,2.0
+PT,1.0
+PR,-4.0
+QA,3.0
+RE,4.0
+RO,3.0
+RU,7.0
+RW,2.0
+BL,-4.0
+AS,-11.0
+WS,-11.0
+AS,13.0
+WS,13.0
+SM,2.0
+ST,0.0
+SA,3.0
+SN,0.0
+RS,2.0
+SC,4.0
+SL,0.0
+SG,8.0
+SK,2.0
+SI,2.0
+SB,11.0
+SO,3.0
+ZA,2.0
+GS,-2.0
+SS,3.0
+ES,2.0
+LK,5.5
+SH,0.0
+KN,-4.0
+SX,-4.0
+MF,-4.0
+SD,3.0
+SR,-3.0
+SJ,2.0
+SZ,2.0
+SE,2.0
+CH,2.0
+SY,3.0
+TW,8.0
+TJ,5.0
+TZ,3.0
+TH,7.0
+TG,0.0
+TK,13.0
+TO,13.0
+TT,-4.0
+TN,1.0
+TR,3.0
+TM,5.0
+TC,-4.0
+TV,12.0
+UG,3.0
+UA,3.0
+AE,4.0
+GB,1.0
+US,-7.0
+UY,-3.0
+UZ,5.0
+VU,11.0
+VA,2.0
+VE,-4.0
+VN,7.0
+VG,-4.0
+VI,-4.0
+VG,-4.0
+VI,-4.0
+WF,12.0
+YE,3.0
+ZM,2.0
+ZW,2.0
--- a/TensorFlow2/Recommendation/WideAndDeep/data/outbrain/spark/data/tensorflow-hadoop-1.5.0.jar
+++ b/TensorFlow2/Recommendation/WideAndDeep/data/outbrain/spark/data/tensorflow-hadoop-1.5.0.jar
--- a/TensorFlow2/Recommendation/WideAndDeep/data/outbrain/spark/data/us_states_abbrev_bst.csv
+++ b/TensorFlow2/Recommendation/WideAndDeep/data/outbrain/spark/data/us_states_abbrev_bst.csv
@ -0,0 +1,52 @@
+state_abb,utc_dst_time_offset_cleaned
+AL,-5.0
+AK,-8.0
+AZ,-7.0
+AR,-5.0
+CA,-7.0
+CO,-6.0
+CT,-4.0
+DE,-4.0
+DC,-4.0
+FL,-4.0
+GA,-4.0
+HI,-10.0
+ID,-6.0
+IL,-5.0
+IN,-4.0
+IA,-5.0
+KS,-5.0
+KY,-4.0
+LA,-5.0
+ME,-4.0
+MD,-4.0
+MA,-4.0
+MI,-4.0
+MN,-5.0
+MS,-5.0
+MO,-5.0
+MT,-6.0
+NE,-5.0
+NV,-7.0
+NH,-4.0
+NJ,-4.0
+NM,-6.0
+NY,-4.0
+NC,-4.0
+ND,-5.0
+OH,-4.0
+OK,-5.0
+OR,-7.0
+PA,-4.0
+RI,-4.0
+SC,-4.0
+SD,-5.0
+TN,-5.0
+TX,-5.0
+UT,-6.0
+VT,-4.0
+VA,-4.0
+WA,-7.0
+WV,-4.0
+WI,-5.0
+WY,-6.0
--- a/TensorFlow2/Recommendation/WideAndDeep/data/outbrain/spark/preproc1.py
+++ b/TensorFlow2/Recommendation/WideAndDeep/data/outbrain/spark/preproc1.py
@ -0,0 +1,104 @@
+#!/usr/bin/env python
+# coding: utf-8
+
+# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from pyspark.context import SparkContext, SparkConf
+from pyspark.sql.functions import col
+from pyspark.sql.session import SparkSession
+from pyspark.sql.types import IntegerType, StringType, StructType, StructField
+
+OUTPUT_BUCKET_FOLDER = "/tmp/spark/preprocessed/"
+DATA_BUCKET_FOLDER = "/outbrain/orig/"
+SPARK_TEMP_FOLDER = "/tmp/spark/spark-temp/"
+
+conf = SparkConf().setMaster('local[*]').set('spark.executor.memory', '40g').set('spark.driver.memory', '200g').set(
+    "spark.local.dir", SPARK_TEMP_FOLDER)
+
+sc = SparkContext(conf=conf)
+spark = SparkSession(sc)
+
+print('Loading data...')
+
+events_schema = StructType(
+    [StructField("display_id", IntegerType(), True),
+     StructField("uuid_event", StringType(), True),
+     StructField("document_id_event", IntegerType(), True),
+     StructField("timestamp_event", IntegerType(), True),
+     StructField("platform_event", IntegerType(), True),
+     StructField("geo_location_event", StringType(), True)]
+)
+
+events_df = spark.read.schema(events_schema) \
+    .options(header='true', inferschema='false', nullValue='\\N') \
+    .csv(DATA_BUCKET_FOLDER + "events.csv") \
+    .withColumn('day_event', (col('timestamp_event') / 1000 / 60 / 60 / 24).cast("int")) \
+    .alias('events')
+
+events_df.count()
+
+print('Drop rows with empty "geo_location"...')
+events_df = events_df.dropna(subset="geo_location_event")
+events_df.count()
+
+print('Drop rows with empty "platform"...')
+events_df = events_df.dropna(subset="platform_event")
+events_df.count()
+
+promoted_content_schema = StructType(
+    [StructField("ad_id", IntegerType(), True),
+     StructField("document_id_promo", IntegerType(), True),
+     StructField("campaign_id", IntegerType(), True),
+     StructField("advertiser_id", IntegerType(), True)]
+)
+
+promoted_content_df = spark.read.schema(promoted_content_schema) \
+    .options(header='true', inferschema='false', nullValue='\\N') \
+    .csv(DATA_BUCKET_FOLDER + "promoted_content.csv") \
+    .alias('promoted_content')
+
+clicks_train_schema = StructType(
+    [StructField("display_id", IntegerType(), True),
+     StructField("ad_id", IntegerType(), True),
+     StructField("clicked", IntegerType(), True)]
+)
+
+clicks_train_df = spark.read.schema(clicks_train_schema) \
+    .options(header='true', inferschema='false', nullValue='\\N') \
+    .csv(DATA_BUCKET_FOLDER + "clicks_train.csv") \
+    .alias('clicks_train')
+
+clicks_train_joined_df = clicks_train_df \
+    .join(promoted_content_df, on='ad_id', how='left') \
+    .join(events_df, on='display_id', how='left')
+clicks_train_joined_df.createOrReplaceTempView('clicks_train_joined')
+
+validation_display_ids_df = clicks_train_joined_df.select('display_id', 'day_event') \
+    .distinct() \
+    .sampleBy("day_event", fractions={0: 0.2, 1: 0.2, 2: 0.2, 3: 0.2, 4: 0.2,
+                                      5: 0.2, 6: 0.2, 7: 0.2, 8: 0.2, 9: 0.2, 10: 0.2, 11: 1.0, 12: 1.0}, seed=0)
+validation_display_ids_df.createOrReplaceTempView("validation_display_ids")
+validation_set_df = spark.sql('''SELECT display_id, ad_id, uuid_event, day_event,
+  timestamp_event, document_id_promo, platform_event, geo_location_event
+  FROM clicks_train_joined t
+    WHERE EXISTS (SELECT display_id FROM validation_display_ids
+      WHERE display_id = t.display_id)''')
+
+validation_set_gcs_output = "validation_set.parquet"
+validation_set_df.write.parquet(OUTPUT_BUCKET_FOLDER + validation_set_gcs_output, mode='overwrite')
+
+print(validation_set_df.take(5))
+
+spark.stop()
--- a/TensorFlow2/Recommendation/WideAndDeep/data/outbrain/spark/preproc2.py
+++ b/TensorFlow2/Recommendation/WideAndDeep/data/outbrain/spark/preproc2.py
--- a/TensorFlow2/Recommendation/WideAndDeep/data/outbrain/spark/preproc3.py
+++ b/TensorFlow2/Recommendation/WideAndDeep/data/outbrain/spark/preproc3.py
@ -0,0 +1,474 @@
+#!/usr/bin/env python
+# coding: utf-8
+
+# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import argparse
+import datetime
+
+import numpy as np
+import pandas as pd
+import pyspark.sql.functions as F
+import tensorflow as tf
+from pyspark import TaskContext
+from pyspark.context import SparkContext, SparkConf
+from pyspark.sql.functions import col, udf
+from pyspark.sql.session import SparkSession
+from pyspark.sql.types import ArrayType, DoubleType
+from tensorflow_transform.tf_metadata import dataset_metadata
+from tensorflow_transform.tf_metadata import dataset_schema
+from tensorflow_transform.tf_metadata import metadata_io
+from data.outbrain.features import PREBATCH_SIZE, HASH_BUCKET_SIZES
+from data.outbrain.spark.utils.feature_description import LABEL_COLUMN, DISPLAY_ID_COLUMN, CATEGORICAL_COLUMNS, \
+    DOC_CATEGORICAL_MULTIVALUED_COLUMNS, BOOL_COLUMNS, INT_COLUMNS, FLOAT_COLUMNS, \
+    FLOAT_COLUMNS_LOG_BIN_TRANSFORM, FLOAT_COLUMNS_SIMPLE_BIN_TRANSFORM, FLOAT_COLUMNS_NO_TRANSFORM
+
+pd.set_option('display.max_columns', 1000)
+evaluation = True
+evaluation_verbose = False
+OUTPUT_BUCKET_FOLDER = "/tmp/spark/preprocessed/"
+DATA_BUCKET_FOLDER = "/data/orig/"
+SPARK_TEMP_FOLDER = "/tmp/spark/spark-temp/"
+LOCAL_DATA_TFRECORDS_DIR = "/outbrain/tfrecords"
+
+TEST_SET_MODE = False
+
+TENSORFLOW_HADOOP = "data/outbrain/spark/data/tensorflow-hadoop-1.5.0.jar"
+
+conf = SparkConf().setMaster('local[*]').set('spark.executor.memory', '40g').set('spark.driver.memory', '200g').set(
+    "spark.local.dir", SPARK_TEMP_FOLDER)
+conf.set("spark.jars", TENSORFLOW_HADOOP)
+conf.set("spark.sql.files.maxPartitionBytes", 805306368)
+
+sc = SparkContext(conf=conf)
+spark = SparkSession(sc)
+
+parser = argparse.ArgumentParser()
+
+parser.add_argument(
+    '--num_train_partitions',
+    help='number of train partitions',
+    type=int,
+    default=40)
+
+parser.add_argument(
+    '--num_valid_partitions',
+    help='number of validation partitions',
+    type=int,
+    default=40)
+args = parser.parse_args()
+num_train_partitions = args.num_train_partitions
+num_valid_partitions = args.num_valid_partitions
+batch_size = PREBATCH_SIZE
+
+# # Feature Vector export
+bool_feature_names = []
+
+int_feature_names = ['ad_views',
+                     'doc_views',
+                     'doc_event_days_since_published',
+                     'doc_ad_days_since_published',
+                     ]
+
+float_feature_names = [
+    'pop_ad_id',
+    'pop_document_id',
+    'pop_publisher_id',
+    'pop_advertiser_id',
+    'pop_campain_id',
+    'pop_source_id',
+    'doc_event_doc_ad_sim_categories',
+    'doc_event_doc_ad_sim_topics',
+    'doc_event_doc_ad_sim_entities',
+]
+
+TRAFFIC_SOURCE_FV = 'traffic_source'
+EVENT_HOUR_FV = 'event_hour'
+EVENT_COUNTRY_FV = 'event_country'
+EVENT_COUNTRY_STATE_FV = 'event_country_state'
+EVENT_GEO_LOCATION_FV = 'event_geo_location'
+EVENT_PLATFORM_FV = 'event_platform'
+AD_ADVERTISER_FV = 'ad_advertiser'
+DOC_AD_SOURCE_ID_FV = 'doc_ad_source_id'
+DOC_AD_PUBLISHER_ID_FV = 'doc_ad_publisher_id'
+DOC_EVENT_SOURCE_ID_FV = 'doc_event_source_id'
+DOC_EVENT_PUBLISHER_ID_FV = 'doc_event_publisher_id'
+DOC_AD_CATEGORY_ID_FV = 'doc_ad_category_id'
+DOC_AD_TOPIC_ID_FV = 'doc_ad_topic_id'
+DOC_AD_ENTITY_ID_FV = 'doc_ad_entity_id'
+DOC_EVENT_CATEGORY_ID_FV = 'doc_event_category_id'
+DOC_EVENT_TOPIC_ID_FV = 'doc_event_topic_id'
+DOC_EVENT_ENTITY_ID_FV = 'doc_event_entity_id'
+
+# ### Configuring feature vector
+category_feature_names_integral = ['ad_advertiser',
+                                   'doc_ad_publisher_id',
+                                   'doc_ad_source_id',
+                                   'doc_event_publisher_id',
+                                   'doc_event_source_id',
+                                   'event_country',
+                                   'event_country_state',
+                                   'event_geo_location',
+                                   'event_hour',
+                                   'event_platform',
+                                   'traffic_source']
+feature_vector_labels_integral = bool_feature_names \
+                                 + int_feature_names \
+                                 + float_feature_names \
+                                 + category_feature_names_integral
+
+train_feature_vector_gcs_folder_name = 'train_feature_vectors_integral_eval'
+
+# ## Exporting integral feature vectors to CSV
+train_feature_vectors_exported_df = spark.read.parquet(OUTPUT_BUCKET_FOLDER + train_feature_vector_gcs_folder_name)
+train_feature_vectors_exported_df.take(3)
+
+integral_headers = ['label', 'display_id', 'ad_id', 'doc_id', 'doc_event_id'] + feature_vector_labels_integral
+
+CSV_ORDERED_COLUMNS = ['label', 'display_id', 'ad_id', 'doc_id', 'doc_event_id', 'ad_views', 'campaign_id','doc_views',
+                       'doc_event_days_since_published', 'doc_ad_days_since_published',
+                       'pop_ad_id', 'pop_document_id', 'pop_publisher_id', 'pop_advertiser_id', 'pop_campain_id',
+                       'pop_source_id',
+                       'doc_event_doc_ad_sim_categories', 'doc_event_doc_ad_sim_topics',
+                       'doc_event_doc_ad_sim_entities', 'ad_advertiser', 'doc_ad_publisher_id',
+                       'doc_ad_source_id', 'doc_event_publisher_id', 'doc_event_source_id', 'event_country',
+                       'event_country_state', 'event_geo_location', 'event_platform',
+                       'traffic_source']
+
+FEAT_CSV_ORDERED_COLUMNS = ['ad_views', 'campaign_id','doc_views',
+                            'doc_event_days_since_published', 'doc_ad_days_since_published',
+                            'pop_ad_id', 'pop_document_id', 'pop_publisher_id', 'pop_advertiser_id', 'pop_campain_id',
+                            'pop_source_id',
+                            'doc_event_doc_ad_sim_categories', 'doc_event_doc_ad_sim_topics',
+                            'doc_event_doc_ad_sim_entities', 'ad_advertiser', 'doc_ad_publisher_id',
+                            'doc_ad_source_id', 'doc_event_publisher_id', 'doc_event_source_id', 'event_country',
+                            'event_country_state', 'event_geo_location', 'event_platform',
+                            'traffic_source']
+
+
+def to_array(col):
+    def to_array_(v):
+        return v.toArray().tolist()
+
+    # Important: asNondeterministic requires Spark 2.3 or later
+    # It can be safely removed i.e.
+    # return udf(to_array_, ArrayType(DoubleType()))(col)
+    # but at the cost of decreased performance
+
+    return udf(to_array_, ArrayType(DoubleType())).asNondeterministic()(col)
+
+
+CONVERT_TO_INT = ['doc_ad_category_id_1',
+                  'doc_ad_category_id_2', 'doc_ad_category_id_3', 'doc_ad_topic_id_1', 'doc_ad_topic_id_2',
+                  'doc_ad_topic_id_3', 'doc_ad_entity_id_1', 'doc_ad_entity_id_2', 'doc_ad_entity_id_3',
+                  'doc_ad_entity_id_4', 'doc_ad_entity_id_5', 'doc_ad_entity_id_6',
+                  'doc_ad_source_id', 'doc_event_category_id_1', 'doc_event_category_id_2', 'doc_event_category_id_3',
+                  'doc_event_topic_id_1', 'doc_event_topic_id_2', 'doc_event_topic_id_3', 'doc_event_entity_id_1',
+                  'doc_event_entity_id_2', 'doc_event_entity_id_3', 'doc_event_entity_id_4', 'doc_event_entity_id_5',
+                  'doc_event_entity_id_6']
+
+
+def format_number(element, name):
+    if name in BOOL_COLUMNS + CATEGORICAL_COLUMNS:
+        return element.cast("int")
+    elif name in CONVERT_TO_INT:
+        return element.cast("int")
+    else:
+        return element
+
+
+def to_array_with_none(col):
+    def to_array_with_none_(v):
+        tmp = np.full((v.size,), fill_value=None, dtype=np.float64)
+        tmp[v.indices] = v.values
+        return tmp.tolist()
+
+    # Important: asNondeterministic requires Spark 2.3 or later
+    # It can be safely removed i.e.
+    # return udf(to_array_, ArrayType(DoubleType()))(col)
+    # but at the cost of decreased performance
+
+    return udf(to_array_with_none_, ArrayType(DoubleType())).asNondeterministic()(col)
+
+
+@udf
+def count_value(x):
+    from collections import Counter
+    tmp = Counter(x).most_common(2)
+    if not tmp or np.isnan(tmp[0][0]):
+        return 0
+    return float(tmp[0][0])
+
+
+def replace_with_most_frequent(most_value):
+    return udf(lambda x: most_value if not x or np.isnan(x) else x)
+
+
+train_feature_vectors_integral_csv_rdd_df = train_feature_vectors_exported_df.select('label', 'display_id', 'ad_id',
+                                                                                     'document_id', 'document_id_event',
+                                                                                     'feature_vector').withColumn(
+    "featvec", to_array("feature_vector")).select(
+    ['label'] + ['display_id'] + ['ad_id'] + ['document_id'] + ['document_id_event'] + [
+        format_number(element, FEAT_CSV_ORDERED_COLUMNS[index]).alias(FEAT_CSV_ORDERED_COLUMNS[index]) for
+        index, element in enumerate([col("featvec")[i] for i in range(len(feature_vector_labels_integral))])]).replace(
+    float('nan'), 0)
+
+test_validation_feature_vector_gcs_folder_name = 'validation_feature_vectors_integral'
+
+# ## Exporting integral feature vectors
+test_validation_feature_vectors_exported_df = spark.read.parquet(
+    OUTPUT_BUCKET_FOLDER + test_validation_feature_vector_gcs_folder_name)
+test_validation_feature_vectors_exported_df = test_validation_feature_vectors_exported_df.repartition(40,
+                                                                                                      'display_id').orderBy(
+    'display_id')
+test_validation_feature_vectors_exported_df.take(3)
+
+test_validation_feature_vectors_integral_csv_rdd_df = test_validation_feature_vectors_exported_df.select(
+    'label', 'display_id', 'ad_id', 'document_id', 'document_id_event', 'feature_vector').withColumn("featvec",
+                                                                                                     to_array(
+                                                                                                         "feature_vector")).select(
+    ['label'] + ['display_id'] + ['ad_id'] + ['document_id'] + ['document_id_event'] + [
+        format_number(element, FEAT_CSV_ORDERED_COLUMNS[index]).alias(FEAT_CSV_ORDERED_COLUMNS[index]) for
+        index, element in enumerate([col("featvec")[i] for i in range(len(feature_vector_labels_integral))])]).replace(
+    float('nan'), 0)
+
+
+def make_spec(output_dir, batch_size=None):
+    fixed_shape = [batch_size, 1] if batch_size is not None else []
+    spec = {}
+    spec[LABEL_COLUMN] = tf.io.FixedLenFeature(shape=fixed_shape, dtype=tf.int64, default_value=None)
+    spec[DISPLAY_ID_COLUMN] = tf.io.FixedLenFeature(shape=fixed_shape, dtype=tf.int64, default_value=None)
+    for name in BOOL_COLUMNS:
+        spec[name] = tf.io.FixedLenFeature(shape=fixed_shape, dtype=tf.int64, default_value=None)
+    for name in FLOAT_COLUMNS_LOG_BIN_TRANSFORM + FLOAT_COLUMNS_SIMPLE_BIN_TRANSFORM + FLOAT_COLUMNS_NO_TRANSFORM:
+        spec[name] = tf.io.FixedLenFeature(shape=fixed_shape, dtype=tf.float32, default_value=None)
+    for name in FLOAT_COLUMNS_SIMPLE_BIN_TRANSFORM:
+        spec[name + '_binned'] = tf.io.FixedLenFeature(shape=fixed_shape, dtype=tf.int64, default_value=None)
+    for name in FLOAT_COLUMNS_LOG_BIN_TRANSFORM:
+        spec[name + '_log_01scaled'] = tf.io.FixedLenFeature(shape=fixed_shape, dtype=tf.float32, default_value=None)
+    for name in INT_COLUMNS:
+        spec[name + '_log_01scaled'] = tf.io.FixedLenFeature(shape=fixed_shape, dtype=tf.float32, default_value=None)
+    for name in BOOL_COLUMNS + CATEGORICAL_COLUMNS:
+        spec[name] = tf.io.FixedLenFeature(shape=fixed_shape, dtype=tf.int64, default_value=None)
+    for multi_category in DOC_CATEGORICAL_MULTIVALUED_COLUMNS:
+        shape = fixed_shape[:-1] + [len(DOC_CATEGORICAL_MULTIVALUED_COLUMNS[multi_category])]
+        spec[multi_category] = tf.io.FixedLenFeature(shape=shape, dtype=tf.int64)
+    metadata = dataset_metadata.DatasetMetadata(dataset_schema.from_feature_spec(spec))
+    metadata_io.write_metadata(metadata, output_dir)
+
+
+# write out tfrecords meta
+make_spec(LOCAL_DATA_TFRECORDS_DIR + '/transformed_metadata', batch_size=batch_size)
+
+
+def log2_1p(x):
+    return np.log1p(x) / np.log(2.0)
+
+
+# calculate min and max stats for the given dataframes all in one go
+def compute_min_max_logs(df):
+    print(str(datetime.datetime.now()) + '\tComputing min and max')
+    min_logs = {}
+    max_logs = {}
+    float_expr = []
+    for name in FLOAT_COLUMNS_LOG_BIN_TRANSFORM + INT_COLUMNS:
+        float_expr.append(F.min(name))
+        float_expr.append(F.max(name))
+    floatDf = all_df.agg(*float_expr).collect()
+    for name in FLOAT_COLUMNS_LOG_BIN_TRANSFORM:
+        minAgg = floatDf[0]["min(" + name + ")"]
+        maxAgg = floatDf[0]["max(" + name + ")"]
+        min_logs[name + '_log_01scaled'] = log2_1p(minAgg * 1000)
+        max_logs[name + '_log_01scaled'] = log2_1p(maxAgg * 1000)
+    for name in INT_COLUMNS:
+        minAgg = floatDf[0]["min(" + name + ")"]
+        maxAgg = floatDf[0]["max(" + name + ")"]
+        min_logs[name + '_log_01scaled'] = log2_1p(minAgg)
+        max_logs[name + '_log_01scaled'] = log2_1p(maxAgg)
+
+    return min_logs, max_logs
+
+
+all_df = test_validation_feature_vectors_integral_csv_rdd_df.union(train_feature_vectors_integral_csv_rdd_df)
+min_logs, max_logs = compute_min_max_logs(all_df)
+
+train_output_string = '/train'
+eval_output_string = '/eval'
+
+path = LOCAL_DATA_TFRECORDS_DIR
+
+
+def create_tf_example_spark(df, min_logs, max_logs):
+    result = {}
+    result[LABEL_COLUMN] = tf.train.Feature(int64_list=tf.train.Int64List(value=df[LABEL_COLUMN].to_list()))
+    result[DISPLAY_ID_COLUMN] = tf.train.Feature(int64_list=tf.train.Int64List(value=df[DISPLAY_ID_COLUMN].to_list()))
+    for name in FLOAT_COLUMNS:
+        value = df[name].to_list()
+        result[name] = tf.train.Feature(float_list=tf.train.FloatList(value=value))
+    for name in FLOAT_COLUMNS_SIMPLE_BIN_TRANSFORM:
+        value = df[name].multiply(10).astype('int64').to_list()
+        result[name + '_binned'] = tf.train.Feature(int64_list=tf.train.Int64List(value=value))
+    for name in FLOAT_COLUMNS_LOG_BIN_TRANSFORM:
+        value_prelim = df[name].multiply(1000).apply(np.log1p).multiply(1. / np.log(2.0))
+        value = value_prelim.astype('int64').to_list()
+        result[name + '_binned'] = tf.train.Feature(int64_list=tf.train.Int64List(value=value))
+        nn = name + '_log_01scaled'
+        value = value_prelim.add(-min_logs[nn]).multiply(1. / (max_logs[nn] - min_logs[nn])).to_list()
+        result[nn] = tf.train.Feature(float_list=tf.train.FloatList(value=value))
+    for name in INT_COLUMNS:
+        value_prelim = df[name].apply(np.log1p).multiply(1. / np.log(2.0))
+        value = value_prelim.astype('int64').to_list()
+        result[name + '_log_int'] = tf.train.Feature(int64_list=tf.train.Int64List(value=value))
+        nn = name + '_log_01scaled'
+        value = value_prelim.add(-min_logs[nn]).multiply(1. / (max_logs[nn] - min_logs[nn])).to_list()
+        result[nn] = tf.train.Feature(float_list=tf.train.FloatList(value=value))
+    for name in BOOL_COLUMNS + CATEGORICAL_COLUMNS:
+        value = df[name].fillna(0).astype('int64').to_list()
+        result[name] = tf.train.Feature(int64_list=tf.train.Int64List(value=value))
+    for multi_category in DOC_CATEGORICAL_MULTIVALUED_COLUMNS:
+        values = []
+        for category in DOC_CATEGORICAL_MULTIVALUED_COLUMNS[multi_category]:
+            values = values + [df[category].to_numpy()]
+        # need to transpose the series so they will be parsed correctly by the FixedLenFeature
+        # we can pass in a single series here; they'll be reshaped to [batch_size, num_values]
+        # when parsed from the TFRecord
+        value = np.stack(values, axis=1).flatten().tolist()
+        result[multi_category] = tf.train.Feature(int64_list=tf.train.Int64List(value=value))
+    tf_example = tf.train.Example(features=tf.train.Features(feature=result))
+    return tf_example
+
+
+def hash_bucket(num_buckets):
+    return lambda x: x % num_buckets
+
+
+def _transform_to_tfrecords(rdds):
+    csv = pd.DataFrame(list(rdds), columns=CSV_ORDERED_COLUMNS)
+    num_rows = len(csv.index)
+    examples = []
+    for start_ind in range(0, num_rows, batch_size if batch_size is not None else 1):  # for each batch
+        if start_ind + batch_size - 1 > num_rows:  # if we'd run out of rows
+            csv_slice = csv.iloc[start_ind:]
+            # drop the remainder
+            print("last Example has: ", len(csv_slice))
+            examples.append((create_tf_example_spark(csv_slice, min_logs, max_logs), len(csv_slice)))
+            return examples
+        else:
+            csv_slice = csv.iloc[start_ind:start_ind + (batch_size if batch_size is not None else 1)]
+        examples.append((create_tf_example_spark(csv_slice, min_logs, max_logs), batch_size))
+    return examples
+
+
+max_partition_num = 30
+
+
+def _transform_to_slices(rdds):
+    taskcontext = TaskContext.get()
+    partitionid = taskcontext.partitionId()
+    csv = pd.DataFrame(list(rdds), columns=CSV_ORDERED_COLUMNS)
+    for name, size in HASH_BUCKET_SIZES.items():
+        if name in csv.columns.values:
+            csv[name] = csv[name].apply(hash_bucket(size))
+    num_rows = len(csv.index)
+    print("working with partition: ", partitionid, max_partition_num, num_rows)
+    examples = []
+    for start_ind in range(0, num_rows, batch_size if batch_size is not None else 1):  # for each batch
+        if start_ind + batch_size - 1 > num_rows:  # if we'd run out of rows
+            csv_slice = csv.iloc[start_ind:]
+            print("last Example has: ", len(csv_slice), partitionid)
+            examples.append((csv_slice, len(csv_slice)))
+            return examples
+        else:
+            csv_slice = csv.iloc[start_ind:start_ind + (batch_size if batch_size is not None else 1)]
+        examples.append((csv_slice, len(csv_slice)))
+    return examples
+
+
+def _transform_to_tfrecords_from_slices(rdds):
+    examples = []
+    for slice in rdds:
+        if len(slice[0]) != batch_size:
+            print("slice size is not correct, dropping: ", len(slice[0]))
+        else:
+            examples.append(
+                (bytearray((create_tf_example_spark(slice[0], min_logs, max_logs)).SerializeToString()), None))
+    return examples
+
+
+def _transform_to_tfrecords_from_reslice(rdds):
+    examples = []
+    all_dataframes = pd.DataFrame([])
+    for slice in rdds:
+        all_dataframes = all_dataframes.append(slice[0])
+    num_rows = len(all_dataframes.index)
+    examples = []
+    for start_ind in range(0, num_rows, batch_size if batch_size is not None else 1):  # for each batch
+        if start_ind + batch_size - 1 > num_rows:  # if we'd run out of rows
+            csv_slice = all_dataframes.iloc[start_ind:]
+            if TEST_SET_MODE:
+                remain_len = batch_size - len(csv_slice)
+                (m, n) = divmod(remain_len, len(csv_slice))
+                print("remainder: ", len(csv_slice), remain_len, m, n)
+                if m:
+                    for i in range(m):
+                        csv_slice = csv_slice.append(csv_slice)
+                csv_slice = csv_slice.append(csv_slice.iloc[:n])
+                print("after fill remainder: ", len(csv_slice))
+                examples.append(
+                    (bytearray((create_tf_example_spark(csv_slice, min_logs, max_logs)).SerializeToString()), None))
+                return examples
+            # drop the remainder
+            print("dropping remainder: ", len(csv_slice))
+            return examples
+        else:
+            csv_slice = all_dataframes.iloc[start_ind:start_ind + (batch_size if batch_size is not None else 1)]
+            examples.append(
+                (bytearray((create_tf_example_spark(csv_slice, min_logs, max_logs)).SerializeToString()), None))
+    return examples
+
+
+TEST_SET_MODE = False
+train_features = train_feature_vectors_integral_csv_rdd_df.coalesce(30).rdd.mapPartitions(_transform_to_slices)
+cached_train_features = train_features.cache()
+train_full = cached_train_features.filter(lambda x: x[1] == batch_size)
+# split out slies where we don't have a full batch so that we can reslice them so we only drop mininal rows
+train_not_full = cached_train_features.filter(lambda x: x[1] < batch_size)
+train_examples_full = train_full.mapPartitions(_transform_to_tfrecords_from_slices)
+train_left = train_not_full.coalesce(1).mapPartitions(_transform_to_tfrecords_from_reslice)
+all_train = train_examples_full.union(train_left)
+
+TEST_SET_MODE = True
+valid_features = test_validation_feature_vectors_integral_csv_rdd_df.repartition(num_valid_partitions,
+                                                                                 'display_id').rdd.mapPartitions(
+    _transform_to_slices)
+cached_valid_features = valid_features.cache()
+valid_full = cached_valid_features.filter(lambda x: x[1] == batch_size)
+valid_not_full = cached_valid_features.filter(lambda x: x[1] < batch_size)
+valid_examples_full = valid_full.mapPartitions(_transform_to_tfrecords_from_slices)
+valid_left = valid_not_full.coalesce(1).mapPartitions(_transform_to_tfrecords_from_reslice)
+all_valid = valid_examples_full.union(valid_left)
+
+all_train.saveAsNewAPIHadoopFile(LOCAL_DATA_TFRECORDS_DIR + train_output_string,
+                                 "org.tensorflow.hadoop.io.TFRecordFileOutputFormat",
+                                 keyClass="org.apache.hadoop.io.BytesWritable",
+                                 valueClass="org.apache.hadoop.io.NullWritable")
+
+all_valid.saveAsNewAPIHadoopFile(LOCAL_DATA_TFRECORDS_DIR + eval_output_string,
+                                 "org.tensorflow.hadoop.io.TFRecordFileOutputFormat",
+                                 keyClass="org.apache.hadoop.io.BytesWritable",
+                                 valueClass="org.apache.hadoop.io.NullWritable")
+
+spark.stop()
--- a/TensorFlow2/Recommendation/WideAndDeep/data/outbrain/spark/utils/feature_description.py
+++ b/TensorFlow2/Recommendation/WideAndDeep/data/outbrain/spark/utils/feature_description.py
@ -0,0 +1,136 @@
+# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+LABEL_COLUMN = "label"
+
+DISPLAY_ID_COLUMN = 'display_id'
+
+IS_LEAK_COLUMN = 'is_leak'
+
+DISPLAY_ID_AND_IS_LEAK_ENCODED_COLUMN = 'display_ad_and_is_leak'
+
+CATEGORICAL_COLUMNS = [
+    'ad_id',
+    'campaign_id',
+    'doc_id',
+    'doc_event_id',
+    'ad_advertiser',
+    'doc_ad_source_id',
+    'doc_ad_publisher_id',
+    'doc_event_publisher_id',
+    'doc_event_source_id',
+    'event_country',
+    'event_country_state',
+    'event_geo_location',
+    'event_platform']
+
+DOC_CATEGORICAL_MULTIVALUED_COLUMNS = {
+}
+
+BOOL_COLUMNS = []
+
+INT_COLUMNS = [
+    'ad_views',
+    'doc_views',
+    'doc_event_days_since_published',
+    'doc_ad_days_since_published']
+
+FLOAT_COLUMNS_LOG_BIN_TRANSFORM = []
+FLOAT_COLUMNS_NO_TRANSFORM = [
+    'pop_ad_id',
+    'pop_document_id',
+    'pop_publisher_id',
+    'pop_advertiser_id',
+    'pop_campain_id',
+    'pop_source_id',
+    'doc_event_doc_ad_sim_categories',
+    'doc_event_doc_ad_sim_topics',
+    'doc_event_doc_ad_sim_entities',
+]
+FLOAT_COLUMNS_SIMPLE_BIN_TRANSFORM = []
+FLOAT_COLUMNS = FLOAT_COLUMNS_LOG_BIN_TRANSFORM + FLOAT_COLUMNS_SIMPLE_BIN_TRANSFORM + FLOAT_COLUMNS_NO_TRANSFORM
+
+REQUEST_SINGLE_HOT_COLUMNS = [
+    "doc_event_id",
+    "doc_id",
+    "doc_event_source_id",
+    "event_geo_location",
+    "event_country_state",
+    "doc_event_publisher_id",
+    "event_country",
+    "event_hour",
+    "event_platform",
+    "traffic_source",
+    "event_weekend",
+    "user_has_already_viewed_doc"]
+
+REQUEST_MULTI_HOT_COLUMNS = [
+    "doc_event_entity_id",
+    "doc_event_topic_id",
+    "doc_event_category_id"]
+
+REQUEST_NUMERIC_COLUMNS = [
+    "pop_document_id_conf",
+    "pop_publisher_id_conf",
+    "pop_source_id_conf",
+    "pop_entity_id_conf",
+    "pop_topic_id_conf",
+    "pop_category_id_conf",
+    "pop_document_id",
+    "pop_publisher_id",
+    "pop_source_id",
+    "pop_entity_id",
+    "pop_topic_id",
+    "pop_category_id",
+    "user_views",
+    "doc_views",
+    "doc_event_days_since_published",
+    "doc_event_hour"]
+
+ITEM_SINGLE_HOT_COLUMNS = [
+    "ad_id",
+    'campaign_id',
+    "doc_ad_source_id",
+    "ad_advertiser",
+    "doc_ad_publisher_id"]
+
+ITEM_MULTI_HOT_COLUMNS = [
+    "doc_ad_topic_id",
+    "doc_ad_entity_id",
+    "doc_ad_category_id"]
+
+ITEM_NUMERIC_COLUMNS = [
+    "pop_ad_id_conf",
+    "user_doc_ad_sim_categories_conf",
+    "user_doc_ad_sim_topics_conf",
+    "pop_advertiser_id_conf",
+    "pop_ad_id",
+    "pop_advertiser_id",
+    "pop_campain_id",
+    "user_doc_ad_sim_categories",
+    "user_doc_ad_sim_topics",
+    "user_doc_ad_sim_entities",
+    "doc_event_doc_ad_sim_categories",
+    "doc_event_doc_ad_sim_topics",
+    "doc_event_doc_ad_sim_entities",
+    "ad_views",
+    "doc_ad_days_since_published"]
+
+NV_TRAINING_COLUMNS = (
+        REQUEST_SINGLE_HOT_COLUMNS +
+        REQUEST_MULTI_HOT_COLUMNS +
+        REQUEST_NUMERIC_COLUMNS +
+        ITEM_SINGLE_HOT_COLUMNS +
+        ITEM_MULTI_HOT_COLUMNS +
+        ITEM_NUMERIC_COLUMNS)
--- a/TensorFlow2/Recommendation/WideAndDeep/img/amp_influence_nvtabular.svg
+++ b/TensorFlow2/Recommendation/WideAndDeep/img/amp_influence_nvtabular.svg
--- a/TensorFlow2/Recommendation/WideAndDeep/img/amp_influence_spark.svg
+++ b/TensorFlow2/Recommendation/WideAndDeep/img/amp_influence_spark.svg
--- a/TensorFlow2/Recommendation/WideAndDeep/img/learning_curve_nvt.svg
+++ b/TensorFlow2/Recommendation/WideAndDeep/img/learning_curve_nvt.svg
--- a/TensorFlow2/Recommendation/WideAndDeep/img/learning_curve_spark_nvt.svg
+++ b/TensorFlow2/Recommendation/WideAndDeep/img/learning_curve_spark_nvt.svg
--- a/TensorFlow2/Recommendation/WideAndDeep/img/leraning_curve_spark.svg
+++ b/TensorFlow2/Recommendation/WideAndDeep/img/leraning_curve_spark.svg
--- a/TensorFlow2/Recommendation/WideAndDeep/img/map_12_amp_influence.svg
+++ b/TensorFlow2/Recommendation/WideAndDeep/img/map_12_amp_influence.svg
--- a/TensorFlow2/Recommendation/WideAndDeep/img/model.svg
+++ b/TensorFlow2/Recommendation/WideAndDeep/img/model.svg
--- a/TensorFlow2/Recommendation/WideAndDeep/img/training_stability_nvtabular.svg
+++ b/TensorFlow2/Recommendation/WideAndDeep/img/training_stability_nvtabular.svg
--- a/TensorFlow2/Recommendation/WideAndDeep/img/training_stability_spark.svg
+++ b/TensorFlow2/Recommendation/WideAndDeep/img/training_stability_spark.svg
--- a/TensorFlow2/Recommendation/WideAndDeep/main.py
+++ b/TensorFlow2/Recommendation/WideAndDeep/main.py
@ -0,0 +1,33 @@
+# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from trainer.model.widedeep import wide_deep_model
+from trainer.run import train, evaluate
+from trainer.utils.arguments import parse_args
+from trainer.utils.setup import create_config
+
+
+def main():
+    args = parse_args()
+    config = create_config(args)
+    model = wide_deep_model(args)
+
+    if args.evaluate:
+        evaluate(args, model, config)
+    else:
+        train(args, model, config)
+
+
+if __name__ == '__main__':
+    main()
--- a/Show more
+++ b/Show more
				`@ -0,0 +1 @@`
				`python ./launch.py --model resnet50 --precision AMP --mode convergence --platform DGX1V /imagenet --workspace ${1:-./} --raport-file raport.json`
				`@ -1 +0,0 @@`
				`python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --data-backend dali-cpu --raport-file raport.json -j8 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 256 --amp --static-loss-scale 128 --epochs 250 --mixup 0.2`
				`@ -0,0 +1 @@`
				`python ./launch.py --model resnext101-32x4d --precision AMP --mode convergence --platform DGX1V /imagenet --workspace ${1:-./} --raport-file raport.json`
				`@ -0,0 +1 @@`
				`python ./launch.py --model se-resnext101-32x4d --precision AMP --mode convergence --platform DGX1V /imagenet --workspace ${1:-./} --raport-file raport.json`