Merge pull request #857 from NVIDIA/gh/release

Gh/release
This commit is contained in:
nv-kkudrynski 2021-03-04 16:17:59 +01:00 committed by GitHub
commit 7fe99babad
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
113 changed files with 23742 additions and 712 deletions

View file

@ -1,4 +1,4 @@
ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:20.07-py3
ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:20.12-py3
FROM ${FROM_IMAGE_NAME}
ADD requirements.txt /workspace/

View file

@ -1,4 +1,4 @@
# Convolutional Networks for Image Classification in PyTorch
# Convolutional Network for Image Classification in PyTorch
In this repository you will find implementations of various image classification models.
@ -9,7 +9,7 @@ Detailed information on each model can be found here:
* [Models](#models)
* [Validation accuracy results](#validation-accuracy-results)
* [Training performance results](#training-performance-results)
* [Training performance: NVIDIA DGX A100 (8x A100 40GB)](#training-performance-nvidia-dgx-a100-8x-a100-40gb)
* [Training performance: NVIDIA DGX A100 (8x A100 80GB)](#training-performance-nvidia-dgx-a100-8x-a100-80gb)
* [Training performance: NVIDIA DGX-1 16GB (8x V100 16GB)](#training-performance-nvidia-dgx-1-16gb-8x-v100-16gb)
* [Training performance: NVIDIA DGX-2 (16x V100 32GB)](#training-performance-nvidia-dgx-2-16x-v100-32gb)
* [Model comparison](#model-comparison)
@ -38,22 +38,20 @@ in the corresponding model's README.
The following table shows the validation accuracy results of the
three classification models side-by-side.
| **arch** | **AMP Top1** | **AMP Top5** | **FP32 Top1** | **FP32 Top5** |
|:-:|:-:|:-:|:-:|:-:|
| resnet50 | 78.46 | 94.15 | 78.50 | 94.11 |
| resnext101-32x4d | 80.08 | 94.89 | 80.14 | 95.02 |
| se-resnext101-32x4d | 81.01 | 95.52 | 81.12 | 95.54 |
| **Model** | **Mixed Precision Top1** | **Mixed Precision Top5** | **32 bit Top1** | **32 bit Top5** |
|:-------------------:|:------------------------:|:------------------------:|:---------------:|:---------------:|
| resnet50 | 78.60 | 94.19 | 78.69 | 94.16 |
| resnext101-32x4d | 80.43 | 95.06 | 80.40 | 95.04 |
| se-resnext101-32x4d | 81.00 | 95.48 | 81.09 | 95.45 |
## Training performance results
### Training performance: NVIDIA DGX A100 (8x A100 40GB)
### Training performance: NVIDIA DGX A100 (8x A100 80GB)
Our results were obtained by running the applicable
training scripts in the pytorch-20.06 NGC container
on NVIDIA DGX A100 with (8x A100 40GB) GPUs.
training scripts in the pytorch-20.12 NGC container
on NVIDIA DGX A100 with (8x A100 80GB) GPUs.
Performance numbers (in images per second)
were averaged over an entire training epoch.
The specific training script that was run is documented
@ -63,21 +61,16 @@ The following table shows the training accuracy results of the
three classification models side-by-side.
| **arch** | **Mixed Precision** | **TF32** | **Mixed Precision Speedup** |
|:-------------------:|:-------------------:|:-------------:|:---------------------------:|
| resnet50 | 9488.39 img/s | 5322.10 img/s | 1.78x |
| resnext101-32x4d | 6758.98 img/s | 2353.25 img/s | 2.87x |
| se-resnext101-32x4d | 4670.72 img/s | 2011.21 img/s | 2.32x |
ResNeXt and SE-ResNeXt use [NHWC data layout](https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html) when training using Mixed Precision,
which improves the model performance. We are currently working on adding it for ResNet.
| **Model** | **Mixed Precision** | **TF32** | **Mixed Precision Speedup** |
|:-------------------:|:-------------------:|:----------:|:---------------------------:|
| resnet50 | 15977 img/s | 7365 img/s | 2.16 x |
| resnext101-32x4d | 7399 img/s | 3193 img/s | 2.31 x |
| se-resnext101-32x4d | 5248 img/s | 2665 img/s | 1.96 x |
### Training performance: NVIDIA DGX-1 16G (8x V100 16GB)
Our results were obtained by running the applicable
training scripts in the pytorch-20.06 NGC container
training scripts in the pytorch-20.12 NGC container
on NVIDIA DGX-1 with (8x V100 16GB) GPUs.
Performance numbers (in images per second)
were averaged over an entire training epoch.
@ -87,16 +80,11 @@ in the corresponding model's README.
The following table shows the training accuracy results of the
three classification models side-by-side.
| **arch** | **Mixed Precision** | **FP32** | **Mixed Precision Speedup** |
|:-------------------:|:-------------------:|:-------------:|:---------------------------:|
| resnet50 | 6565.61 img/s | 2869.19 img/s | 2.29x |
| resnext101-32x4d | 3922.74 img/s | 1136.30 img/s | 3.45x |
| se-resnext101-32x4d | 2651.13 img/s | 982.78 img/s | 2.70x |
ResNeXt and SE-ResNeXt use [NHWC data layout](https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html) when training using Mixed Precision,
which improves the model performance. We are currently working on adding it for ResNet.
| **Model** | **Mixed Precision** | **FP32** | **Mixed Precision Speedup** |
|:-------------------:|:-------------------:|:----------:|:---------------------------:|
| resnet50 | 7608 img/s | 2851 img/s | 2.66 x |
| resnext101-32x4d | 3742 img/s | 1117 img/s | 3.34 x |
| se-resnext101-32x4d | 2716 img/s | 994 img/s | 2.73 x |
## Model Comparison
@ -111,8 +99,6 @@ Dot size indicates number of trainable parameters.
### Latency vs Throughput on different batch sizes
![LATvsTHR](./img/LATvsTHR.png)
Plot describes relationship between
inference latency, throughput and batch size
Plot describes relationship between
inference latency, throughput and batch size
for the implemented models.

View file

@ -30,7 +30,7 @@ if __name__ == "__main__":
add_parser_arguments(parser)
args = parser.parse_args()
checkpoint = torch.load(args.checkpoint_path)
checkpoint = torch.load(args.checkpoint_path, map_location=torch.device('cpu'))
model_state_dict = {
k[len("module.") :] if "module." in k else k: v
@ -39,4 +39,4 @@ if __name__ == "__main__":
print(f"Loaded {checkpoint['arch']} : {checkpoint['best_prec1']}")
torch.save(model_state_dict, args.weight_path)
torch.save(model_state_dict, args.weight_path.format(arch=checkpoint['arch'][0], acc = checkpoint['best_prec1']))

View file

@ -16,19 +16,12 @@ import argparse
import numpy as np
import json
import torch
from torch.cuda.amp import autocast
import torch.backends.cudnn as cudnn
import torchvision.transforms as transforms
import image_classification.resnet as models
from image_classification.dataloaders import load_jpeg_from_file
try:
from apex.fp16_utils import *
from apex import amp
except ImportError:
raise ImportError(
"Please install apex from https://www.github.com/nvidia/apex to run this example."
)
def add_parser_arguments(parser):
model_names = models.resnet_versions.keys()
@ -52,7 +45,7 @@ def add_parser_arguments(parser):
)
parser.add_argument("--weights", metavar="<path>", help="file with model weights")
parser.add_argument(
"--precision", metavar="PREC", default="FP16", choices=["AMP", "FP16", "FP32"]
"--precision", metavar="PREC", default="AMP", choices=["AMP", "FP32"]
)
parser.add_argument("--image", metavar="<path>", help="path to classified image")
@ -63,30 +56,28 @@ def main(args):
if args.weights is not None:
weights = torch.load(args.weights)
#Temporary fix to allow NGC checkpoint loading
# Temporary fix to allow NGC checkpoint loading
weights = {
k.replace("module.", ""): v for k, v in weights.items()
}
model.load_state_dict(weights)
model = model.cuda()
if args.precision in ["AMP", "FP16"]:
model = network_to_half()
model.eval()
with torch.no_grad():
input = load_jpeg_from_file(
args.image, cuda=True, fp16=args.precision != "FP32"
)
input = load_jpeg_from_file(
args.image, cuda=True
)
output = torch.nn.functional.softmax(model(input), dim=1).cpu().view(-1).numpy()
top5 = np.argsort(output)[-5:][::-1]
with torch.no_grad(), autocast(enabled = args.precision == "AMP"):
output = torch.nn.functional.softmax(model(input), dim=1)
print(args.image)
for c, v in zip(imgnet_classes[top5], output[top5]):
print(f"{c}: {100*v:.1f}%")
output = output.float().cpu().view(-1).numpy()
top5 = np.argsort(output)[-5:][::-1]
print(args.image)
for c, v in zip(imgnet_classes[top5], output[top5]):
print(f"{c}: {100*v:.1f}%")
if __name__ == "__main__":

View file

@ -0,0 +1,183 @@
precision:
AMP:
static_loss_scale: 128
amp: True
FP32:
amp: False
TF32:
amp: False
platform:
T4:
workers: 8
DGX1V:
workers: 8
DGX2V:
workers: 8
DGXA100:
workers: 16
mode:
benchmark_training: &benchmark_training
print_freq: 1
epochs: 3
training_only: True
evaluate: False
save_checkpoints: False
benchmark_training_short:
<<: *benchmark_training
epochs: 1
data_backend: syntetic
prof: 100
benchmark_inference: &benchmark_inference
print_freq: 1
epochs: 1
training_only: False
evaluate: True
save_checkpoints: False
convergence:
print_freq: 20
training_only: False
evaluate: False
save_checkpoints: True
anchors:
# ResNet_like params: {{{
resnet_params: &resnet_params
model_config: fanin
label_smoothing: 0.1
mixup: 0.2
lr_schedule: cosine
momentum: 0.875
warmup: 8
epochs: 250
data_backend: pytorch
num_classes: 1000
image_size: 224
resnet_params_896: &resnet_params_896
<<: *resnet_params
optimizer_batch_size: 896
lr: 0.896
weight_decay: 6.103515625e-05
resnet_params_1k: &resnet_params_1k
<<: *resnet_params
optimizer_batch_size: 1024
lr: 1.024
weight_decay: 6.103515625e-05
resnet_params_2k: &resnet_params_2k
<<: *resnet_params
optimizer_batch_size: 2048
lr: 2.048
weight_decay: 3.0517578125e-05
resnet_params_4k: &resnet_params_4k
<<: *resnet_params
optimizer_batch_size: 4096
lr: 4.086
weight_decay: 3.0517578125e-05
# }}}
models:
resnet50: # {{{
DGX1V:
AMP:
<<: *resnet_params_2k
arch: resnet50
batch_size: 256
memory_format: nhwc
FP32:
<<: *resnet_params_2k
batch_size: 112
DGX2V:
AMP:
<<: *resnet_params_4k
arch: resnet50
batch_size: 256
memory_format: nhwc
FP32:
<<: *resnet_params_4k
arch: resnet50
batch_size: 256
DGXA100:
AMP:
<<: *resnet_params_2k
arch: resnet50
batch_size: 256
memory_format: nhwc
TF32:
<<: *resnet_params_2k
arch: resnet50
batch_size: 256
T4:
AMP:
<<: *resnet_params_2k
arch: resnet50
batch_size: 256
memory_format: nhwc
FP32:
<<: *resnet_params_2k
batch_size: 128
# }}}
resnext101-32x4d: # {{{
DGX1V:
AMP:
<<: *resnet_params_1k
arch: resnext101-32x4d
batch_size: 128
memory_format: nhwc
FP32:
<<: *resnet_params_1k
arch: resnext101-32x4d
batch_size: 64
DGXA100:
AMP:
<<: *resnet_params_1k
arch: resnext101-32x4d
batch_size: 128
memory_format: nhwc
TF32:
<<: *resnet_params_1k
arch: resnext101-32x4d
batch_size: 128
T4:
AMP:
<<: *resnet_params_1k
arch: resnext101-32x4d
batch_size: 128
memory_format: nhwc
FP32:
<<: *resnet_params_1k
arch: resnext101-32x4d
batch_size: 64
# }}}
se-resnext101-32x4d: # {{{
DGX1V:
AMP:
<<: *resnet_params_896
arch: se-resnext101-32x4d
batch_size: 112
memory_format: nhwc
FP32:
<<: *resnet_params_1k
arch: se-resnext101-32x4d
batch_size: 64
DGXA100:
AMP:
<<: *resnet_params_1k
arch: se-resnext101-32x4d
batch_size: 128
memory_format: nhwc
TF32:
<<: *resnet_params_1k
arch: se-resnext101-32x4d
batch_size: 128
T4:
AMP:
<<: *resnet_params_1k
arch: se-resnext101-32x4d
batch_size: 128
memory_format: nhwc
FP32:
<<: *resnet_params_1k
arch: se-resnext101-32x4d
batch_size: 64
# }}}

View file

@ -50,7 +50,7 @@ except ImportError:
)
def load_jpeg_from_file(path, cuda=True, fp16=False):
def load_jpeg_from_file(path, cuda=True):
img_transforms = transforms.Compose(
[transforms.Resize(256), transforms.CenterCrop(224), transforms.ToTensor()]
)
@ -67,12 +67,7 @@ def load_jpeg_from_file(path, cuda=True, fp16=False):
mean = mean.cuda()
std = std.cuda()
img = img.cuda()
if fp16:
mean = mean.half()
std = std.half()
img = img.half()
else:
img = img.float()
img = img.float()
input = img.unsqueeze(0).sub_(mean).div_(std)
@ -98,6 +93,7 @@ class HybridTrainPipe(Pipeline):
shard_id=rank,
num_shards=world_size,
random_shuffle=True,
pad_last_batch=True,
)
if dali_cpu:
@ -125,10 +121,9 @@ class HybridTrainPipe(Pipeline):
self.cmnp = ops.CropMirrorNormalize(
device="gpu",
output_dtype=types.FLOAT,
dtype=types.FLOAT,
output_layout=types.NCHW,
crop=(crop, crop),
image_type=types.RGB,
mean=[0.485 * 255, 0.456 * 255, 0.406 * 255],
std=[0.229 * 255, 0.224 * 255, 0.225 * 255],
)
@ -160,16 +155,16 @@ class HybridValPipe(Pipeline):
shard_id=rank,
num_shards=world_size,
random_shuffle=False,
pad_last_batch=True,
)
self.decode = ops.ImageDecoder(device="mixed", output_type=types.RGB)
self.res = ops.Resize(device="gpu", resize_shorter=size)
self.cmnp = ops.CropMirrorNormalize(
device="gpu",
output_dtype=types.FLOAT,
dtype=types.FLOAT,
output_layout=types.NCHW,
crop=(crop, crop),
image_type=types.RGB,
mean=[0.485 * 255, 0.456 * 255, 0.406 * 255],
std=[0.229 * 255, 0.224 * 255, 0.225 * 255],
)
@ -213,7 +208,6 @@ def get_dali_train_loader(dali_cpu=False):
start_epoch=0,
workers=5,
_worker_init_fn=None,
fp16=False,
memory_format=torch.contiguous_format,
):
if torch.distributed.is_initialized():
@ -236,7 +230,7 @@ def get_dali_train_loader(dali_cpu=False):
pipe.build()
train_loader = DALIClassificationIterator(
pipe, size=int(pipe.epoch_size("Reader") / world_size)
pipe, reader_name="Reader", fill_last_batch=False
)
return (
@ -255,7 +249,6 @@ def get_dali_val_loader():
one_hot,
workers=5,
_worker_init_fn=None,
fp16=False,
memory_format=torch.contiguous_format,
):
if torch.distributed.is_initialized():
@ -278,7 +271,7 @@ def get_dali_val_loader():
pipe.build()
val_loader = DALIClassificationIterator(
pipe, size=int(pipe.epoch_size("Reader") / world_size)
pipe, reader_name="Reader", fill_last_batch=False
)
return (
@ -317,7 +310,7 @@ def expand(num_classes, dtype, tensor):
class PrefetchedWrapper(object):
def prefetched_loader(loader, num_classes, fp16, one_hot):
def prefetched_loader(loader, num_classes, one_hot):
mean = (
torch.tensor([0.485 * 255, 0.456 * 255, 0.406 * 255])
.cuda()
@ -328,9 +321,6 @@ class PrefetchedWrapper(object):
.cuda()
.view(1, 3, 1, 1)
)
if fp16:
mean = mean.half()
std = std.half()
stream = torch.cuda.Stream()
first = True
@ -339,14 +329,9 @@ class PrefetchedWrapper(object):
with torch.cuda.stream(stream):
next_input = next_input.cuda(non_blocking=True)
next_target = next_target.cuda(non_blocking=True)
if fp16:
next_input = next_input.half()
if one_hot:
next_target = expand(num_classes, torch.half, next_target)
else:
next_input = next_input.float()
if one_hot:
next_target = expand(num_classes, torch.float, next_target)
next_input = next_input.float()
if one_hot:
next_target = expand(num_classes, torch.float, next_target)
next_input = next_input.sub_(mean).div_(std)
@ -361,9 +346,8 @@ class PrefetchedWrapper(object):
yield input, target
def __init__(self, dataloader, start_epoch, num_classes, fp16, one_hot):
def __init__(self, dataloader, start_epoch, num_classes, one_hot):
self.dataloader = dataloader
self.fp16 = fp16
self.epoch = start_epoch
self.one_hot = one_hot
self.num_classes = num_classes
@ -376,7 +360,7 @@ class PrefetchedWrapper(object):
self.dataloader.sampler.set_epoch(self.epoch)
self.epoch += 1
return PrefetchedWrapper.prefetched_loader(
self.dataloader, self.num_classes, self.fp16, self.one_hot
self.dataloader, self.num_classes, self.one_hot
)
def __len__(self):
@ -391,7 +375,6 @@ def get_pytorch_train_loader(
start_epoch=0,
workers=5,
_worker_init_fn=None,
fp16=False,
memory_format=torch.contiguous_format,
):
traindir = os.path.join(data_path, "train")
@ -403,24 +386,24 @@ def get_pytorch_train_loader(
)
if torch.distributed.is_initialized():
train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset, shuffle=True)
else:
train_sampler = None
train_loader = torch.utils.data.DataLoader(
train_dataset,
sampler=train_sampler,
batch_size=batch_size,
shuffle=(train_sampler is None),
num_workers=workers,
worker_init_fn=_worker_init_fn,
pin_memory=True,
sampler=train_sampler,
collate_fn=partial(fast_collate, memory_format),
drop_last=True,
)
return (
PrefetchedWrapper(train_loader, start_epoch, num_classes, fp16, one_hot),
PrefetchedWrapper(train_loader, start_epoch, num_classes, one_hot),
len(train_loader),
)
@ -432,7 +415,6 @@ def get_pytorch_val_loader(
one_hot,
workers=5,
_worker_init_fn=None,
fp16=False,
memory_format=torch.contiguous_format,
):
valdir = os.path.join(data_path, "val")
@ -441,7 +423,7 @@ def get_pytorch_val_loader(
)
if torch.distributed.is_initialized():
val_sampler = torch.utils.data.distributed.DistributedSampler(val_dataset)
val_sampler = torch.utils.data.distributed.DistributedSampler(val_dataset, shuffle=False)
else:
val_sampler = None
@ -449,20 +431,20 @@ def get_pytorch_val_loader(
val_dataset,
sampler=val_sampler,
batch_size=batch_size,
shuffle=False,
shuffle=(val_sampler is None),
num_workers=workers,
worker_init_fn=_worker_init_fn,
pin_memory=True,
collate_fn=partial(fast_collate, memory_format),
drop_last=False,
)
return PrefetchedWrapper(val_loader, 0, num_classes, fp16, one_hot), len(val_loader)
return PrefetchedWrapper(val_loader, 0, num_classes, one_hot), len(val_loader)
class SynteticDataLoader(object):
def __init__(
self,
fp16,
batch_size,
num_classes,
num_channels,
@ -483,8 +465,6 @@ class SynteticDataLoader(object):
else:
input_target = torch.randint(0, num_classes, (batch_size,))
input_target = input_target.cuda()
if fp16:
input_data = input_data.half()
self.input_data = input_data
self.input_target = input_target
@ -502,19 +482,11 @@ def get_syntetic_loader(
start_epoch=0,
workers=None,
_worker_init_fn=None,
fp16=False,
memory_format=torch.contiguous_format,
):
return (
SynteticDataLoader(
fp16,
batch_size,
num_classes,
3,
224,
224,
one_hot,
memory_format=memory_format,
batch_size, num_classes, 3, 224, 224, one_hot, memory_format=memory_format
),
-1,
)

View file

@ -38,14 +38,8 @@ from . import resnet as models
from . import utils
import dllogger
try:
from apex.parallel import DistributedDataParallel as DDP
from apex.fp16_utils import *
from apex import amp
except ImportError:
raise ImportError(
"Please install apex from https://www.github.com/nvidia/apex to run this example."
)
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.cuda.amp import autocast
ACC_METADATA = {"unit": "%", "format": ":.2f"}
IPS_METADATA = {"unit": "img/s", "format": ":.2f"}
@ -60,7 +54,6 @@ class ModelAndLoss(nn.Module):
loss,
pretrained_weights=None,
cuda=True,
fp16=False,
memory_format=torch.contiguous_format,
):
super(ModelAndLoss, self).__init__()
@ -74,8 +67,6 @@ class ModelAndLoss(nn.Module):
if cuda:
model = model.cuda().to(memory_format=memory_format)
if fp16:
model = network_to_half(model)
# define loss function (criterion) and optimizer
criterion = loss()
@ -92,8 +83,8 @@ class ModelAndLoss(nn.Module):
return loss, output
def distributed(self):
self.model = DDP(self.model)
def distributed(self, gpu_id):
self.model = DDP(self.model, device_ids=[gpu_id], output_device=gpu_id)
def load_model_state(self, state):
if not state is None:
@ -102,14 +93,11 @@ class ModelAndLoss(nn.Module):
def get_optimizer(
parameters,
fp16,
lr,
momentum,
weight_decay,
nesterov=False,
state=None,
static_loss_scale=1.0,
dynamic_loss_scale=False,
bn_weight_decay=False,
):
@ -138,13 +126,6 @@ def get_optimizer(
weight_decay=weight_decay,
nesterov=nesterov,
)
if fp16:
optimizer = FP16_Optimizer(
optimizer,
static_loss_scale=static_loss_scale,
dynamic_loss_scale=dynamic_loss_scale,
verbose=False,
)
if not state is None:
optimizer.load_state_dict(state)
@ -227,36 +208,25 @@ def lr_exponential_policy(
def get_train_step(
model_and_loss, optimizer, fp16, use_amp=False, batch_size_multiplier=1
model_and_loss, optimizer, scaler, use_amp=False, batch_size_multiplier=1
):
def _step(input, target, optimizer_step=True):
input_var = Variable(input)
target_var = Variable(target)
loss, output = model_and_loss(input_var, target_var)
if torch.distributed.is_initialized():
reduced_loss = utils.reduce_tensor(loss.data)
else:
reduced_loss = loss.data
if fp16:
optimizer.backward(loss)
elif use_amp:
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
else:
loss.backward()
with autocast(enabled=use_amp):
loss, output = model_and_loss(input_var, target_var)
loss /= batch_size_multiplier
if torch.distributed.is_initialized():
reduced_loss = utils.reduce_tensor(loss.data)
else:
reduced_loss = loss.data
scaler.scale(loss).backward()
if optimizer_step:
opt = (
optimizer.optimizer
if isinstance(optimizer, FP16_Optimizer)
else optimizer
)
for param_group in opt.param_groups:
for param in param_group["params"]:
param.grad /= batch_size_multiplier
optimizer.step()
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad()
torch.cuda.synchronize()
@ -270,10 +240,11 @@ def train(
train_loader,
model_and_loss,
optimizer,
scaler,
lr_scheduler,
fp16,
logger,
epoch,
timeout_handler,
use_amp=False,
prof=-1,
batch_size_multiplier=1,
@ -315,7 +286,7 @@ def train(
step = get_train_step(
model_and_loss,
optimizer,
fp16,
scaler=scaler,
use_amp=use_amp,
batch_size_multiplier=batch_size_multiplier,
)
@ -342,31 +313,33 @@ def train(
it_time = time.time() - end
if logger is not None:
logger.log_metric("train.loss", to_python_float(loss), bs)
logger.log_metric("train.loss", loss.item(), bs)
logger.log_metric("train.compute_ips", calc_ips(bs, it_time - data_time))
logger.log_metric("train.total_ips", calc_ips(bs, it_time))
logger.log_metric("train.data_time", data_time)
logger.log_metric("train.compute_time", it_time - data_time)
end = time.time()
if timeout_handler.interrupted:
break
def get_val_step(model_and_loss):
def get_val_step(model_and_loss, use_amp=False):
def _step(input, target):
input_var = Variable(input)
target_var = Variable(target)
with torch.no_grad():
with torch.no_grad(), autocast(enabled=use_amp):
loss, output = model_and_loss(input_var, target_var)
prec1, prec5 = utils.accuracy(output.data, target, topk=(1, 5))
prec1, prec5 = utils.accuracy(output.data, target, topk=(1, 5))
if torch.distributed.is_initialized():
reduced_loss = utils.reduce_tensor(loss.data)
prec1 = utils.reduce_tensor(prec1)
prec5 = utils.reduce_tensor(prec5)
else:
reduced_loss = loss.data
if torch.distributed.is_initialized():
reduced_loss = utils.reduce_tensor(loss.data)
prec1 = utils.reduce_tensor(prec1)
prec5 = utils.reduce_tensor(prec5)
else:
reduced_loss = loss.data
torch.cuda.synchronize()
@ -376,7 +349,13 @@ def get_val_step(model_and_loss):
def validate(
val_loader, model_and_loss, fp16, logger, epoch, prof=-1, register_metrics=True
val_loader,
model_and_loss,
logger,
epoch,
use_amp=False,
prof=-1,
register_metrics=True,
):
if register_metrics and logger is not None:
logger.register_metric(
@ -440,7 +419,7 @@ def validate(
metadata=TIME_METADATA,
)
step = get_val_step(model_and_loss)
step = get_val_step(model_and_loss, use_amp=use_amp)
top1 = log.AverageMeter()
# switch to evaluate mode
@ -462,11 +441,11 @@ def validate(
it_time = time.time() - end
top1.record(to_python_float(prec1), bs)
top1.record(prec1.item(), bs)
if logger is not None:
logger.log_metric("val.top1", to_python_float(prec1), bs)
logger.log_metric("val.top5", to_python_float(prec5), bs)
logger.log_metric("val.loss", to_python_float(loss), bs)
logger.log_metric("val.top1", prec1.item(), bs)
logger.log_metric("val.top5", prec5.item(), bs)
logger.log_metric("val.loss", loss.item(), bs)
logger.log_metric("val.compute_ips", calc_ips(bs, it_time - data_time))
logger.log_metric("val.total_ips", calc_ips(bs, it_time))
logger.log_metric("val.data_time", data_time)
@ -492,10 +471,10 @@ def calc_ips(batch_size, time):
def train_loop(
model_and_loss,
optimizer,
scaler,
lr_scheduler,
train_loader,
val_loader,
fp16,
logger,
should_backup_checkpoint,
use_amp=False,
@ -510,70 +489,77 @@ def train_loop(
checkpoint_dir="./",
checkpoint_filename="checkpoint.pth.tar",
):
prec1 = -1
print(f"RUNNING EPOCHS FROM {start_epoch} TO {end_epoch}")
for epoch in range(start_epoch, end_epoch):
if logger is not None:
logger.start_epoch()
if not skip_training:
train(
train_loader,
model_and_loss,
optimizer,
lr_scheduler,
fp16,
logger,
epoch,
use_amp=use_amp,
prof=prof,
register_metrics=epoch == start_epoch,
batch_size_multiplier=batch_size_multiplier,
)
if not skip_validation:
prec1, nimg = validate(
val_loader,
model_and_loss,
fp16,
logger,
epoch,
prof=prof,
register_metrics=epoch == start_epoch,
)
if logger is not None:
logger.end_epoch()
if save_checkpoints and (
not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0
):
if not skip_validation:
is_best = logger.metrics["val.top1"]["meter"].get_epoch() > best_prec1
best_prec1 = max(
logger.metrics["val.top1"]["meter"].get_epoch(), best_prec1
with utils.TimeoutHandler() as timeout_handler:
for epoch in range(start_epoch, end_epoch):
if logger is not None:
logger.start_epoch()
if not skip_training:
train(
train_loader,
model_and_loss,
optimizer,
scaler,
lr_scheduler,
logger,
epoch,
timeout_handler,
use_amp=use_amp,
prof=prof,
register_metrics=epoch == start_epoch,
batch_size_multiplier=batch_size_multiplier,
)
else:
is_best = False
best_prec1 = 0
if should_backup_checkpoint(epoch):
backup_filename = "checkpoint-{}.pth.tar".format(epoch + 1)
else:
backup_filename = None
utils.save_checkpoint(
{
"epoch": epoch + 1,
"arch": model_and_loss.arch,
"state_dict": model_and_loss.model.state_dict(),
"best_prec1": best_prec1,
"optimizer": optimizer.state_dict(),
},
is_best,
checkpoint_dir=checkpoint_dir,
backup_filename=backup_filename,
filename=checkpoint_filename,
)
if not skip_validation:
prec1, nimg = validate(
val_loader,
model_and_loss,
logger,
epoch,
use_amp=use_amp,
prof=prof,
register_metrics=epoch == start_epoch,
)
if logger is not None:
logger.end_epoch()
if save_checkpoints and (
not torch.distributed.is_initialized()
or torch.distributed.get_rank() == 0
):
if not skip_validation:
is_best = (
logger.metrics["val.top1"]["meter"].get_epoch() > best_prec1
)
best_prec1 = max(
logger.metrics["val.top1"]["meter"].get_epoch(), best_prec1
)
else:
is_best = False
best_prec1 = 0
if should_backup_checkpoint(epoch):
backup_filename = "checkpoint-{}.pth.tar".format(epoch + 1)
else:
backup_filename = None
utils.save_checkpoint(
{
"epoch": epoch + 1,
"arch": model_and_loss.arch,
"state_dict": model_and_loss.model.state_dict(),
"best_prec1": best_prec1,
"optimizer": optimizer.state_dict(),
},
is_best,
checkpoint_dir=checkpoint_dir,
backup_filename=backup_filename,
filename=checkpoint_filename,
)
if timeout_handler.interrupted:
break
# }}}

View file

@ -31,6 +31,7 @@ import os
import numpy as np
import torch
import shutil
import signal
import torch.distributed as dist
@ -106,3 +107,45 @@ def reduce_tensor(tensor):
def first_n(n, generator):
for i, d in zip(range(n), generator):
yield d
class TimeoutHandler:
def __init__(self, sig=signal.SIGTERM):
self.sig = sig
rank = dist.get_rank() if dist.is_initialized() else 0
self.device = f'cuda:{rank}'
@property
def interrupted(self):
if not dist.is_initialized():
return self._interrupted
interrupted = torch.tensor(self._interrupted).int().to(self.device)
dist.broadcast(interrupted, 0)
interrupted = bool(interrupted.item())
return interrupted
def __enter__(self):
self._interrupted = False
self.released = False
self.original_handler = signal.getsignal(self.sig)
def master_handler(signum, frame):
self.release()
self._interrupted = True
print(f'Received SIGTERM')
def ignorind_handler(signum, frame):
self.release()
print('Received SIGTERM, ignoring')
rank = dist.get_rank() if dist.is_initialized() else 0
if rank == 0:
signal.signal(self.sig, master_handler)
else:
signal.signal(self.sig, ignorind_handler)
return self
def __exit__(self, type, value, tb):
self.release()
def release(self):
if self.released:
return False
signal.signal(self.sig, self.original_handler)
self.released = True
return True

View file

@ -0,0 +1,50 @@
import os
from pathlib import Path
from dataclasses import dataclass
from typing import Dict, Any
import yaml
from main import main, add_parser_arguments
import torch.backends.cudnn as cudnn
import argparse
def get_config_path():
return Path(os.path.dirname(os.path.abspath(__file__))) / "configs.yml"
if __name__ == "__main__":
yaml_cfg_parser = argparse.ArgumentParser(add_help=False)
yaml_cfg_parser.add_argument(
"--cfg_file",
default=get_config_path(),
type=str,
help="path to yaml config file",
)
yaml_cfg_parser.add_argument("--model", default=None, type=str, required=True)
yaml_cfg_parser.add_argument("--mode", default=None, type=str, required=True)
yaml_cfg_parser.add_argument("--precision", default=None, type=str, required=True)
yaml_cfg_parser.add_argument("--platform", default=None, type=str, required=True)
yaml_args, rest = yaml_cfg_parser.parse_known_args()
with open(yaml_args.cfg_file, "r") as cfg_file:
config = yaml.load(cfg_file, Loader=yaml.FullLoader)
cfg = {
**config["precision"][yaml_args.precision],
**config["platform"][yaml_args.platform],
**config["models"][yaml_args.model][yaml_args.platform][yaml_args.precision],
**config["mode"][yaml_args.mode],
}
print(cfg)
parser = argparse.ArgumentParser(description="PyTorch ImageNet Training")
add_parser_arguments(parser)
parser.set_defaults(**cfg)
args = parser.parse_args(rest)
print(args)
cudnn.benchmark = True
main(args)

View file

@ -32,6 +32,7 @@ import os
import shutil
import time
import random
import signal
import numpy as np
import torch
@ -45,15 +46,7 @@ import torch.utils.data
import torch.utils.data.distributed
import torchvision.transforms as transforms
import torchvision.datasets as datasets
try:
from apex.parallel import DistributedDataParallel as DDP
from apex.fp16_utils import *
from apex import amp
except ImportError:
raise ImportError(
"Please install apex from https://www.github.com/nvidia/apex to run this example."
)
from torch.nn.parallel import DistributedDataParallel as DDP
import image_classification.resnet as models
import image_classification.logger as log
@ -224,12 +217,11 @@ def add_parser_arguments(parser):
help="load weights from here",
)
parser.add_argument("--fp16", action="store_true", help="Run model fp16 mode.")
parser.add_argument(
"--static-loss-scale",
type=float,
default=1,
help="Static loss scale, positive power of 2 values can improve fp16 convergence.",
help="Static loss scale, positive power of 2 values can improve amp convergence.",
)
parser.add_argument(
"--dynamic-loss-scale",
@ -312,10 +304,6 @@ def main(args):
dist.init_process_group(backend="nccl", init_method="env://")
args.world_size = torch.distributed.get_world_size()
if args.amp and args.fp16:
print("Please use only one of the --fp16/--amp flags")
exit(1)
if args.seed is not None:
print("Using seed = {}".format(args.seed))
torch.manual_seed(args.seed + args.local_rank)
@ -324,22 +312,25 @@ def main(args):
random.seed(args.seed + args.local_rank)
def _worker_init_fn(id):
def handler(signum, frame):
print(f"Worker {id} received signal {signum}")
signal.signal(signal.SIGTERM, handler)
np.random.seed(seed=args.seed + args.local_rank + id)
random.seed(args.seed + args.local_rank + id)
else:
def _worker_init_fn(id):
pass
def handler(signum, frame):
print(f"Worker {id} received signal {signum}")
if args.fp16:
assert (
torch.backends.cudnn.enabled
), "fp16 mode requires cudnn backend to be enabled."
signal.signal(signal.SIGTERM, handler)
if args.static_loss_scale != 1.0:
if not args.fp16:
print("Warning: if --fp16 is not used, static_loss_scale will be ignored.")
if not args.amp:
print("Warning: if --amp is not used, static_loss_scale will be ignored.")
if args.optimizer_batch_size < 0:
batch_size_multiplier = 1
@ -387,6 +378,11 @@ def main(args):
args.resume, checkpoint["epoch"]
)
)
if start_epoch >= args.epochs:
print(
f"Launched training for {args.epochs}, checkpoint already run {start_epoch}"
)
exit(1)
else:
print("=> no checkpoint found at '{}'".format(args.resume))
model_state = None
@ -410,7 +406,6 @@ def main(args):
loss,
pretrained_weights=pretrained_weights,
cuda=True,
fp16=args.fp16,
memory_format=memory_format,
)
@ -427,6 +422,9 @@ def main(args):
elif args.data_backend == "syntetic":
get_val_loader = get_syntetic_loader
get_train_loader = get_syntetic_loader
else:
print("Bad databackend picked")
exit(1)
train_loader, train_loader_len = get_train_loader(
args.data,
@ -435,7 +433,6 @@ def main(args):
args.mixup > 0.0,
start_epoch=start_epoch,
workers=args.workers,
fp16=args.fp16,
memory_format=memory_format,
)
if args.mixup != 0.0:
@ -447,7 +444,6 @@ def main(args):
args.num_classes,
False,
workers=args.workers,
fp16=args.fp16,
memory_format=memory_format,
)
@ -473,15 +469,12 @@ def main(args):
optimizer = get_optimizer(
list(model_and_loss.model.named_parameters()),
args.fp16,
args.lr,
args.momentum,
args.weight_decay,
nesterov=args.nesterov,
bn_weight_decay=args.bn_weight_decay,
state=optimizer_state,
static_loss_scale=args.static_loss_scale,
dynamic_loss_scale=args.dynamic_loss_scale,
)
if args.lr_schedule == "step":
@ -493,26 +486,26 @@ def main(args):
elif args.lr_schedule == "linear":
lr_policy = lr_linear_policy(args.lr, args.warmup, args.epochs, logger=logger)
if args.amp:
model_and_loss, optimizer = amp.initialize(
model_and_loss,
optimizer,
opt_level="O1",
loss_scale="dynamic" if args.dynamic_loss_scale else args.static_loss_scale,
)
scaler = torch.cuda.amp.GradScaler(
init_scale=args.static_loss_scale,
growth_factor=2,
backoff_factor=0.5,
growth_interval=100 if args.dynamic_loss_scale else 1000000000,
enabled=args.amp,
)
if args.distributed:
model_and_loss.distributed()
model_and_loss.distributed(args.gpu)
model_and_loss.load_model_state(model_state)
train_loop(
model_and_loss,
optimizer,
scaler,
lr_policy,
train_loader,
val_loader,
args.fp16,
logger,
should_backup_checkpoint(args),
use_amp=args.amp,

View file

@ -1,2 +1 @@
pytorch-ignite
git+git://github.com/NVIDIA/dllogger.git@26a0f8f1958de2c0c460925ff6102a4d2486d6cc#egg=dllogger

View file

@ -30,12 +30,12 @@ achieve state-of-the-art accuracy, and is tested and maintained by NVIDIA.
* [Inference performance benchmark](#inference-performance-benchmark)
* [Results](#results)
* [Training accuracy results](#training-accuracy-results)
* [Training accuracy: NVIDIA DGX A100 (8x A100 40GB)](#training-accuracy-nvidia-dgx-a100-8x-a100-40gb)
* [Training accuracy: NVIDIA DGX A100 (8x A100 80GB)](#training-accuracy-nvidia-dgx-a100-8x-a100-80gb)
* [Training accuracy: NVIDIA DGX-1 (8x V100 16GB)](#training-accuracy-nvidia-dgx-1-8x-v100-16gb)
* [Training accuracy: NVIDIA DGX-2 (16x V100 32GB)](#training-accuracy-nvidia-dgx-2-16x-v100-32gb)
* [Example plots](#example-plots)
* [Training performance results](#training-performance-results)
* [Training performance: NVIDIA DGX A100 (8x A100 40GB)](#training-performance-nvidia-dgx-a100-8x-a100-40gb)
* [Training performance: NVIDIA DGX A100 (8x A100 80GB)](#training-performance-nvidia-dgx-a100-8x-a100-80gb)
* [Training performance: NVIDIA DGX-1 16GB (8x V100 16GB)](#training-performance-nvidia-dgx-1-16gb-8x-v100-16gb)
* [Training performance: NVIDIA DGX-1 32GB (8x V100 32GB)](#training-performance-nvidia-dgx-1-32gb-8x-v100-32gb)
* [Inference performance results](#inference-performance-results)
@ -119,6 +119,8 @@ and this recipe keeps the original assumption that validation is done on 224px i
Using 288px images means that a lot more FLOPs are needed during inference to reach the same accuracy.
### Feature support matrix
The following features are supported by this model:
@ -204,7 +206,7 @@ The following section lists the requirements that you need to meet in order to s
This repository contains Dockerfile which extends the PyTorch NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
* [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
* [PyTorch 20.06-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch) or newer
* [PyTorch 20.12-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch) or newer
* Supported GPUs:
* [NVIDIA Volta architecture](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
* [NVIDIA Turing architecture](https://www.nvidia.com/en-us/geforce/turing/)
@ -256,28 +258,28 @@ For the specifics concerning training and inference, see the [Advanced](#advance
The directory in which the `train/` and `val/` directories are placed, is referred to as `<path to imagenet>` in this document.
### 3. Build the RN50v1.5 PyTorch NGC container.
### 3. Build the ResNet50 PyTorch NGC container.
```
docker build . -t nvidia_rn50
docker build . -t nvidia_resnet50
```
### 4. Start an interactive session in the NGC container to run training/inference.
```
nvidia-docker run --rm -it -v <path to imagenet>:/data/imagenet --ipc=host nvidia_rn50
nvidia-docker run --rm -it -v <path to imagenet>:/imagenet --ipc=host nvidia_resnet50
```
### 5. Start training
To run training for a standard configuration (DGXA100/DGX1/DGX2, AMP/TF32/FP32, 50/90/250 Epochs),
To run training for a standard configuration (DGXA100/DGX1V/DGX2V, AMP/TF32/FP32, 90/250 Epochs),
run one of the scripts in the `./resnet50v1.5/training` directory
called `./resnet50v1.5/training/{AMP, TF32, FP32}/{DGXA100, DGX1, DGX2}_RN50_{AMP, TF32, FP32}_{50,90,250}E.sh`.
called `./resnet50v1.5/training/{AMP, TF32, FP32}/{ DGXA100, DGX1V, DGX2V }_resnet50_{AMP, TF32, FP32}_{ 90, 250 }E.sh`.
Ensure ImageNet is mounted in the `/data/imagenet` directory.
Ensure ImageNet is mounted in the `/imagenet` directory.
Example:
`bash ./resnet50v1.5/training/AMP/DGX1_RN50_AMP_250E.sh <path were to store checkpoints and logs>`
`bash ./resnet50v1.5/training/AMP/DGX1_resnet50_AMP_250E.sh <path were to store checkpoints and logs>`
### 6. Start inference
@ -295,7 +297,7 @@ To run inference on ImageNet, run:
To run inference on JPEG image using pretrained weights:
`python classify.py --arch resnet50 -c fanin --weights nvidia_resnet50_200821.pth.tar --precision AMP|FP32 --image <path to JPEG image>`
`python classify.py --arch resnet50 -c fanin --weights nvidia_resnet50_200821.pth.tar --precision AMP|FP32 --image <path to JPEG image>`
## Advanced
@ -334,7 +336,7 @@ usage: main.py [-h] [--data-backend BACKEND] [--arch ARCH]
[--lr-schedule SCHEDULE] [--warmup E] [--label-smoothing S]
[--mixup ALPHA] [--momentum M] [--weight-decay W]
[--bn-weight-decay] [--nesterov] [--print-freq N]
[--resume PATH] [--pretrained-weights PATH] [--fp16]
[--resume PATH] [--pretrained-weights PATH]
[--static-loss-scale STATIC_LOSS_SCALE] [--dynamic-loss-scale]
[--prof N] [--amp] [--seed SEED] [--gather-checkpoints]
[--raport-file RAPORT_FILE] [--evaluate] [--training-only]
@ -353,8 +355,10 @@ optional arguments:
data backend: pytorch | syntetic | dali-gpu | dali-cpu
(default: dali-cpu)
--arch ARCH, -a ARCH model architecture: resnet18 | resnet34 | resnet50 |
resnet101 | resnet152 | resnext101-32x4d | se-
resnext101-32x4d (default: resnet50)
resnet101 | resnet152 | resnext50-32x4d |
resnext101-32x4d | resnext101-32x8d |
resnext101-32x8d-basic | se-resnext101-32x4d (default:
resnet50)
--model-config CONF, -c CONF
model configs: classic | fanin | grp-fanin | grp-
fanout(default: classic)
@ -383,10 +387,9 @@ optional arguments:
--resume PATH path to latest checkpoint (default: none)
--pretrained-weights PATH
load weights from here
--fp16 Run model fp16 mode.
--static-loss-scale STATIC_LOSS_SCALE
Static loss scale, positive power of 2 values can
improve fp16 convergence.
improve amp convergence.
--dynamic-loss-scale Use dynamic loss scaling. If supplied, this argument
supersedes --static-loss-scale.
--prof N Run only N iterations
@ -404,6 +407,7 @@ optional arguments:
--workspace DIR path to directory where checkpoints will be stored
--memory-format {nchw,nhwc}
memory layout, nchw or nhwc
```
@ -414,24 +418,7 @@ To use your own dataset, divide it in directories as in the following scheme:
- Training images - `train/<class id>/<image>`
- Validation images - `val/<class id>/<image>`
If your dataset's has number of classes different than 1000, you need to add a custom config
in the `image_classification/resnet.py` file.
```python
resnet_versions = {
...
'resnet50-custom' : {
'net' : ResNet,
'block' : Bottleneck,
'layers' : [3, 4, 6, 3],
'widths' : [64, 128, 256, 512],
'expansion' : 4,
'num_classes' : <custom number of classes>,
}
}
```
After adding the config, run the training script with `--arch resnet50-custom` flag.
If your dataset's has number of classes different than 1000, you need to pass `--num-classes N` flag to the training script.
### Training process
@ -454,7 +441,7 @@ To restart training from checkpoint use `--resume` option.
To start training from pretrained weights (e.g. downloaded from NGC) use `--pretrained-weights` option.
The difference between those two is that the pretrained weights contain only model weights,
and checkpoints, apart from model weights, contain optimizer state, LR scheduler state, RNG state.
and checkpoints, apart from model weights, contain optimizer state, LR scheduler state.
Checkpoints are suitable for dividing the training into parts, for example in order
to divide the training job into shorter stages, or restart training after infrastructure fail.
@ -500,14 +487,13 @@ wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/resnet50_
unzip resnet50_pyt_amp_20.06.0.zip
```
To run inference on ImageNet, run:
`python ./main.py --arch resnet50 --evaluate --epochs 1 --pretrained-weights nvidia_resnet50_200821.pth.tar -b <batch size> <path to imagenet>`
To run inference on JPEG image using pretrained weights:
`python classify.py --arch resnet50 -c fanin --weights nvidia_resnet50_200821.pth.tar --precision AMP|FP32 --image <path to JPEG image>`
`python classify.py --arch resnet50 --weights nvidia_resnet50_200821.pth.tar --precision AMP|FP32 --image <path to JPEG image>`
## Performance
@ -521,72 +507,63 @@ The following section shows how to run benchmarks measuring the model performanc
To benchmark training, run:
* For 1 GPU
* FP32
`python ./main.py --arch resnet50 -b <batch_size> --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 <path to imagenet>`
* FP32 (V100 GPUs only)
`python ./launch.py --model resnet50 --precision FP32 --mode benchmark_training --platform DGX1V <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
* TF32 (A100 GPUs only)
`python ./launch.py --model resnet50 --precision TF32 --mode benchmark_training --platform DGXA100 <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
* AMP
`python ./main.py --arch resnet50 -b <batch_size> --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --amp --static-loss-scale 256 <path to imagenet>`
`python ./launch.py --model resnet50 --precision AMP --mode benchmark_training --platform <DGX1V|DGXA100> <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
* For multiple GPUs
* FP32
`python ./multiproc.py --nproc_per_node 8 ./main.py --arch resnet50 -b <batch_size> --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 <path to imagenet>`
* FP32 (V100 GPUs only)
`python ./launch.py --model resnet50 --precision FP32 --mode benchmark_training --platform DGX1V <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
* TF32 (A100 GPUs only)
`python ./multiproc.py --nproc_per_node 8 ./launch.py --model resnet50 --precision TF32 --mode benchmark_training --platform DGXA100 <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
* AMP
`python ./multiproc.py --nproc_per_node 8 ./main.py --arch resnet50 -b <batch_size> --training-only -p 1 --raport-file benchmark.json --amp --static-loss-scale 256 --epochs 1 --prof 100 <path to imagenet>`
`python ./multiproc.py --nproc_per_node 8 ./launch.py --model resnet50 --precision AMP --mode benchmark_training --platform <DGX1V|DGXA100> <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
Each of these scripts will run 100 iterations and save results in the `benchmark.json` file.
Batch size should be picked appropriately depending on the hardware configuration.
| *Platform* | *Precision* | *Batch Size* |
|:----------:|:-----------:|:------------:|
| DGXA100 | AMP | 256 |
| DGXA100 | TF32 | 256 |
| DGX-1 | AMP | 256 |
| DGX-1 | FP32 | 128 |
#### Inference performance benchmark
To benchmark inference, run:
* FP32
* FP32 (V100 GPUs only)
`python ./main.py --arch resnet50 -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --evaluate <path to imagenet>`
`python ./launch.py --model resnet50 --precision FP32 --mode benchmark_inference --platform DGX1V <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
* TF32 (A100 GPUs only)
`python ./launch.py --model resnet50 --precision FP32 --mode benchmark_inference --platform DGXA100 <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
* AMP
`python ./main.py --arch resnet50 -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --evaluate --amp <path to imagenet>`
`python ./launch.py --model resnet50 --precision AMP --mode benchmark_inference --platform <DGX1V|DGXA100> <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
Each of these scripts will run 100 iterations and save results in the `benchmark.json` file.
Batch size should be picked appropriately depending on the hardware configuration.
| *Platform* | *Precision* | *Batch Size* |
|:----------:|:-----------:|:------------:|
| DGXA100 | AMP | 256 |
| DGXA100 | TF32 | 256 |
| DGX-1 | AMP | 256 |
| DGX-1 | FP32 | 128 |
### Results
Our results were obtained by running the applicable training script in the pytorch-20.06 NGC container.
Our results were obtained by running the applicable training script in the pytorch-20.12 NGC container.
To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
#### Training accuracy results
##### Training accuracy: NVIDIA DGX A100 (8x A100 40GB)
##### Training accuracy: NVIDIA DGX A100 (8x A100 80GB)
| **epochs** | **Mixed Precision Top1** | **TF32 Top1** |
|:------:|:--------------------:|:--------------:|
| 90 | 76.93 +/- 0.23 | 76.85 +/- 0.30 |
| **Epochs** | **Mixed Precision Top1** | **TF32 Top1** |
|:----------:|:------------------------:|:--------------:|
| 90 | 77.12 +/- 0.11 | 76.95 +/- 0.18 |
| 250 | 78.43 +/- 0.11 | 78.38 +/- 0.17 |
##### Training accuracy: NVIDIA DGX-1 (8x V100 16GB)
| **epochs** | **Mixed Precision Top1** | **FP32 Top1** |
|:-:|:-:|:-:|
| 50 | 76.25 +/- 0.04 | 76.26 +/- 0.07 |
| 90 | 77.09 +/- 0.10 | 77.01 +/- 0.16 |
| 250 | 78.42 +/- 0.04 | 78.30 +/- 0.16 |
| **Epochs** | **Mixed Precision Top1** | **FP32 Top1** |
|:----------:|:------------------------:|:--------------:|
| 90 | 76.88 +/- 0.16 | 77.01 +/- 0.16 |
| 250 | 78.25 +/- 0.12 | 78.30 +/- 0.16 |
##### Training accuracy: NVIDIA DGX-2 (16x V100 32GB)
@ -610,26 +587,28 @@ The following images show a 250 epochs configuration on a DGX-1V.
#### Training performance results
##### Training performance: NVIDIA DGX A100 (8x A100 40GB)
##### Training performance: NVIDIA DGX A100 (8x A100 80GB)
| **GPUs** | **Mixed Precision** | **TF32** | **Mixed Precision Speedup** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **TF32 Strong Scaling** | **TF32 Training Time (90E)** |
|:--------:|:-------------------:|:----------:|:---------------------------:|:----------------------------------:|:---------------------------------------:|:-----------------------:|:----------------------------:|
| 1 | 2461 img/s | 945 img/s | 2.6 x | 1.0 x | ~14 hours | 1.0 x | ~36 hours |
| 8 | 15977 img/s | 7365 img/s | 2.16 x | 6.49 x | ~3 hours | 7.78 x | ~5 hours |
|**GPUs**|**Mixed Precision**| **TF32** |**Mixed Precision Speedup**|**Mixed Precision Strong Scaling**|**Mixed Precision Training Time (90E)**|**TF32 Strong Scaling**|**TF32 Training Time (90E)**|
|:------:|:-----------------:|:-----------:|:-------------------------:|:--------------------------------:|:-------------------------------------:|:---------------------:|:--------------------------:|
| 1 | 1240.81 img/s |680.15 img/s | 1.82x | 1.00x | ~27 hours | 1.00x | ~49 hours |
| 8 | 9604.92 img/s |5379.82 img/s| 1.79x | 7.74x | ~4 hours | 7.91x | ~6 hours |
##### Training performance: NVIDIA DGX-1 16GB (8x V100 16GB)
|**GPUs**|**Mixed Precision**| **FP32** |**Mixed Precision Speedup**|**Mixed Precision Strong Scaling**|**Mixed Precision Training Time (90E)**|**FP32 Strong Scaling**|**FP32 Training Time (90E)**|
|:------:|:-----------------:|:-----------:|:-------------------------:|:--------------------------------:|:-------------------------------------:|:---------------------:|:--------------------------:|
| 1 | 856.52 img/s |373.21 img/s | 2.30x | 1.00x | ~39 hours | 1.00x | ~89 hours |
| 8 | 6635.90 img/s |2899.62 img/s| 2.29x | 7.75x | ~5 hours | 7.77x | ~12 hours |
| **GPUs** | **Mixed Precision** | **FP32** | **Mixed Precision Speedup** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **FP32 Strong Scaling** | **FP32 Training Time (90E)** |
|:--------:|:-------------------:|:----------:|:---------------------------:|:----------------------------------:|:---------------------------------------:|:-----------------------:|:----------------------------:|
| 1 | 1180 img/s | 371 img/s | 3.17 x | 1.0 x | ~29 hours | 1.0 x | ~91 hours |
| 8 | 7608 img/s | 2851 img/s | 2.66 x | 6.44 x | ~5 hours | 7.66 x | ~12 hours |
##### Training performance: NVIDIA DGX-1 32GB (8x V100 32GB)
|**GPUs**|**Mixed Precision**| **FP32** |**Mixed Precision Speedup**|**Mixed Precision Strong Scaling**|**Mixed Precision Training Time (90E)**|**FP32 Strong Scaling**|**FP32 Training Time (90E)**|
|:------:|:-----------------:|:-----------:|:-------------------------:|:--------------------------------:|:-------------------------------------:|:---------------------:|:--------------------------:|
| 1 | 816.00 img/s |359.76 img/s | 2.27x | 1.00x | ~41 hours | 1.00x | ~93 hours |
| 8 | 6347.26 img/s |2813.23 img/s| 2.26x | 7.78x | ~5 hours | 7.82x | ~12 hours |
| **GPUs** | **Mixed Precision** | **FP32** | **Mixed Precision Speedup** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **FP32 Strong Scaling** | **FP32 Training Time (90E)** |
|:--------:|:-------------------:|:----------:|:---------------------------:|:----------------------------------:|:---------------------------------------:|:-----------------------:|:----------------------------:|
| 1 | 1115 img/s | 365 img/s | 3.04 x | 1.0 x | ~31 hours | 1.0 x | ~92 hours |
| 8 | 7375 img/s | 2811 img/s | 2.62 x | 6.61 x | ~5 hours | 7.68 x | ~12 hours |
#### Inference performance results
@ -638,66 +617,66 @@ The following images show a 250 epochs configuration on a DGX-1V.
###### FP32 Inference Latency
| **batch size** | **Throughput Avg** | **Latency Avg** | **Latency 90%** | **Latency 95%** | **Latency 99%** |
|:-:|:-:|:-:|:-:|:-:|:-:|
| 1 | 136.82 img/s | 7.12ms | 7.25ms | 8.36ms | 10.92ms |
| 2 | 266.86 img/s | 7.27ms | 7.41ms | 7.85ms | 9.11ms |
| 4 | 521.76 img/s | 7.44ms | 7.58ms | 8.14ms | 10.09ms |
| 8 | 766.22 img/s | 10.18ms | 10.46ms | 10.97ms | 12.75ms |
| 16 | 976.36 img/s | 15.79ms | 15.88ms | 15.95ms | 16.63ms |
| 32 | 1092.27 img/s | 28.63ms | 28.71ms | 28.76ms | 29.30ms |
| 64 | 1161.55 img/s | 53.69ms | 53.86ms | 53.90ms | 54.23ms |
| 128 | 1209.12 img/s | 104.24ms | 104.68ms | 104.80ms | 105.00ms |
| 256 | N/A | N/A | N/A | N/A | N/A |
| **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
|:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
| 1 | 99 img/s | 10.38 ms | 11.24 ms | 12.32 ms |
| 2 | 190 img/s | 10.87 ms | 12.18 ms | 14.27 ms |
| 4 | 403 img/s | 10.26 ms | 11.02 ms | 13.28 ms |
| 8 | 754 img/s | 10.96 ms | 11.99 ms | 13.89 ms |
| 16 | 960 img/s | 17.16 ms | 16.74 ms | 18.18 ms |
| 32 | 1057 img/s | 31.39 ms | 30.4 ms | 30.55 ms |
| 64 | 1168 img/s | 57.1 ms | 55.01 ms | 56.19 ms |
| 112 | 1166 img/s | 100.78 ms | 95.98 ms | 97.43 ms |
| 128 | 1215 img/s | 111.11 ms | 105.52 ms | 106.38 ms |
| 256 | 1253 img/s | 217.03 ms | 203.78 ms | 208.68 ms |
###### Mixed Precision Inference Latency
| **batch size** | **Throughput Avg** | **Latency Avg** | **Latency 90%** | **Latency 95%** | **Latency 99%** |
|:-:|:-:|:-:|:-:|:-:|:-:|
| 1 | 114.97 img/s | 8.56ms | 9.32ms | 11.43ms | 12.79ms |
| 2 | 238.70 img/s | 8.20ms | 8.75ms | 9.49ms | 12.31ms |
| 4 | 448.69 img/s | 8.67ms | 9.20ms | 9.97ms | 10.60ms |
| 8 | 875.00 img/s | 8.88ms | 9.31ms | 9.80ms | 10.82ms |
| 16 | 1746.07 img/s | 8.89ms | 9.05ms | 9.56ms | 12.81ms |
| 32 | 2004.28 img/s | 14.07ms | 14.14ms | 14.31ms | 14.92ms |
| 64 | 2254.60 img/s | 25.93ms | 26.05ms | 26.07ms | 26.17ms |
| 128 | 2360.14 img/s | 50.14ms | 50.28ms | 50.34ms | 50.68ms |
| 256 | 2342.13 img/s | 96.74ms | 96.91ms | 96.99ms | 97.14ms |
| **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
|:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
| 1 | 82 img/s | 12.43 ms | 13.29 ms | 14.89 ms |
| 2 | 157 img/s | 13.04 ms | 13.84 ms | 16.79 ms |
| 4 | 310 img/s | 13.26 ms | 14.42 ms | 15.63 ms |
| 8 | 646 img/s | 12.69 ms | 13.65 ms | 15.48 ms |
| 16 | 1188 img/s | 14.01 ms | 15.56 ms | 18.34 ms |
| 32 | 2093 img/s | 16.41 ms | 18.25 ms | 19.9 ms |
| 64 | 2899 img/s | 24.12 ms | 22.14 ms | 22.55 ms |
| 128 | 3142 img/s | 45.28 ms | 40.77 ms | 42.89 ms |
| 256 | 3276 img/s | 88.44 ms | 77.8 ms | 79.01 ms |
| 256 | 3276 img/s | 88.6 ms | 77.74 ms | 79.11 ms |
##### Inference performance: NVIDIA T4
###### FP32 Inference Latency
| **batch size** | **Throughput Avg** | **Latency Avg** | **Latency 90%** | **Latency 95%** | **Latency 99%** |
|:-:|:-:|:-:|:-:|:-:|:-:|
| 1 | 179.85 img/s | 5.51ms | 5.65ms | 7.34ms | 10.97ms |
| 2 | 348.12 img/s | 5.67ms | 5.95ms | 6.33ms | 9.81ms |
| 4 | 556.27 img/s | 7.03ms | 7.34ms | 8.13ms | 9.65ms |
| 8 | 740.43 img/s | 10.32ms | 10.33ms | 10.60ms | 13.87ms |
| 16 | 909.17 img/s | 17.19ms | 17.15ms | 18.13ms | 21.06ms |
| 32 | 999.07 img/s | 31.07ms | 31.12ms | 31.17ms | 32.41ms |
| 64 | 1090.47 img/s | 57.62ms | 57.84ms | 57.91ms | 58.05ms |
| 128 | 1142.46 img/s | 110.94ms | 111.15ms | 111.23ms | 112.16ms |
| 256 | N/A | N/A | N/A | N/A | N/A |
| **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
|:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
| 1 | 147 img/s | 7.28 ms | 8.48 ms | 9.79 ms |
| 2 | 251 img/s | 8.48 ms | 10.23 ms | 14.01 ms |
| 4 | 303 img/s | 13.57 ms | 13.61 ms | 15.42 ms |
| 8 | 329 img/s | 24.7 ms | 24.74 ms | 25.0 ms |
| 16 | 371 img/s | 43.73 ms | 43.74 ms | 44.03 ms |
| 32 | 395 img/s | 82.36 ms | 82.13 ms | 82.58 ms |
| 64 | 421 img/s | 155.37 ms | 153.07 ms | 153.55 ms |
| 128 | 426 img/s | 309.06 ms | 303.0 ms | 307.42 ms |
| 256 | 419 img/s | 631.43 ms | 612.42 ms | 614.82 ms |
###### Mixed Precision Inference Latency
| **batch size** | **Throughput Avg** | **Latency Avg** | **Latency 90%** | **Latency 95%** | **Latency 99%** |
|:-:|:-:|:-:|:-:|:-:|:-:|
| 1 | 163.78 img/s | 6.05ms | 5.92ms | 7.98ms | 11.58ms |
| 2 | 333.43 img/s | 5.91ms | 6.05ms | 6.63ms | 11.52ms |
| 4 | 645.45 img/s | 6.04ms | 6.33ms | 7.01ms | 8.90ms |
| 8 | 1164.15 img/s | 6.73ms | 7.31ms | 8.04ms | 12.41ms |
| 16 | 1606.42 img/s | 9.53ms | 9.86ms | 10.52ms | 17.01ms |
| 32 | 1857.29 img/s | 15.67ms | 15.61ms | 16.14ms | 18.66ms |
| 64 | 2011.62 img/s | 28.64ms | 28.69ms | 28.82ms | 31.06ms |
| 128 | 2083.90 img/s | 54.87ms | 54.96ms | 54.99ms | 55.27ms |
| 256 | 2043.72 img/s | 106.51ms | 106.62ms | 106.68ms | 107.03ms |
| **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
|:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
| 1 | 112 img/s | 9.25 ms | 9.87 ms | 10.62 ms |
| 2 | 223 img/s | 9.4 ms | 10.62 ms | 13.9 ms |
| 4 | 468 img/s | 9.06 ms | 11.15 ms | 15.5 ms |
| 8 | 844 img/s | 10.05 ms | 12.67 ms | 17.86 ms |
| 16 | 1037 img/s | 16.01 ms | 15.66 ms | 15.86 ms |
| 32 | 1103 img/s | 30.27 ms | 29.45 ms | 29.74 ms |
| 64 | 1154 img/s | 57.96 ms | 56.33 ms | 56.96 ms |
| 128 | 1177 img/s | 114.95 ms | 110.4 ms | 111.1 ms |
| 256 | 1184 img/s | 229.61 ms | 217.84 ms | 224.75 ms |
## Release notes
@ -720,9 +699,9 @@ The following images show a 250 epochs configuration on a DGX-1V.
5. July 2020
* Added A100 scripts
* Updated README
6. February 2021
* Moved from APEX AMP to Native AMP
### Known issues
There are no known issues with this model.

View file

@ -0,0 +1 @@
python ./launch.py --model resnet50 --precision AMP --mode convergence --platform DGX1V /imagenet --workspace ${1:-./} --raport-file raport.json

View file

@ -0,0 +1 @@
python ./launch.py --model resnet50 --precision AMP --mode convergence --platform DGX1V /imagenet --epochs 90 --mixup 0.0 --workspace ${1:-./} --raport-file raport.json

View file

@ -1 +0,0 @@
python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --data-backend dali-cpu --raport-file raport.json -j8 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 256 --amp --static-loss-scale 128 --epochs 250 --mixup 0.2

View file

@ -1 +0,0 @@
python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --data-backend dali-cpu --raport-file raport.json -j8 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 256 --amp --static-loss-scale 128 --epochs 50

View file

@ -1 +0,0 @@
python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --data-backend dali-cpu --raport-file raport.json -j8 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 256 --amp --static-loss-scale 128 --epochs 90

View file

@ -0,0 +1 @@
python ./launch.py --model resnet50 --precision AMP --mode convergence --platform DGX2V /imagenet --workspace ${1:-./} --raport-file raport.json

View file

@ -0,0 +1 @@
python ./launch.py --model resnet50 --precision AMP --mode convergence --platform DGX2V /imagenet --epochs 90 --mixup 0.0 --workspace ${1:-./} --raport-file raport.json

View file

@ -1 +0,0 @@
python ./multiproc.py --nproc_per_node 16 ./main.py /imagenet --data-backend dali-gpu --raport-file raport.json -j8 -p 100 --lr 4.096 --optimizer-batch-size 4096 --warmup 16 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 256 --amp --static-loss-scale 128 --epochs 250 --mixup 0.2

View file

@ -1 +0,0 @@
python ./multiproc.py --nproc_per_node 16 ./main.py /imagenet --data-backend dali-gpu --raport-file raport.json -j8 -p 100 --lr 4.096 --optimizer-batch-size 4096 --warmup 16 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 256 --amp --static-loss-scale 128 --epochs 50

View file

@ -1 +0,0 @@
python ./multiproc.py --nproc_per_node 16 ./main.py /imagenet --data-backend dali-gpu --raport-file raport.json -j8 -p 100 --lr 4.096 --optimizer-batch-size 4096 --warmup 16 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 256 --amp --static-loss-scale 128 --epochs 90

View file

@ -1 +0,0 @@
python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --data-backend dali-cpu --raport-file raport.json -j16 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 256 --amp --static-loss-scale 128 --epochs 90

View file

@ -0,0 +1 @@
python ./launch.py --model resnet50 --precision AMP --mode convergence --platform DGXA100 /imagenet --workspace ${1:-./} --raport-file raport.json

View file

@ -0,0 +1 @@
python ./launch.py --model resnet50 --precision AMP --mode convergence --platform DGXA100 /imagenet --epochs 90 --mixup 0.0 --workspace ${1:-./} --raport-file raport.json

View file

@ -0,0 +1 @@
python ./launch.py --model resnet50 --precision FP32 --mode convergence --platform DGX1V /imagenet --workspace ${1:-./} --raport-file raport.json

View file

@ -0,0 +1 @@
python ./launch.py --model resnet50 --precision FP32 --mode convergence --platform DGX1V /imagenet --epochs 90 --mixup 0.0 --workspace ${1:-./} --raport-file raport.json

View file

@ -1 +0,0 @@
python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --data-backend dali-cpu --raport-file raport.json -j8 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 128 --epochs 250 --mixup 0.2

View file

@ -1 +0,0 @@
python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --data-backend dali-cpu --raport-file raport.json -j8 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 128 --epochs 50

View file

@ -1 +0,0 @@
python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --data-backend dali-cpu --raport-file raport.json -j8 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 128 --epochs 90

View file

@ -0,0 +1 @@
python ./launch.py --model resnet50 --precision FP32 --mode convergence --platform DGX2V /imagenet --workspace ${1:-./} --raport-file raport.json

View file

@ -0,0 +1 @@
python ./launch.py --model resnet50 --precision FP32 --mode convergence --platform DGX2V /imagenet --epochs 90 --mixup 0.0 --workspace ${1:-./} --raport-file raport.json

View file

@ -1 +0,0 @@
python ./multiproc.py --nproc_per_node 16 ./main.py /imagenet --data-backend dali-gpu --raport-file raport.json -j8 -p 100 --lr 4.096 --optimizer-batch-size 4096 --warmup 16 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 128 --epochs 250 --mixup 0.2

View file

@ -1 +0,0 @@
python ./multiproc.py --nproc_per_node 16 ./main.py /imagenet --data-backend dali-gpu --raport-file raport.json -j8 -p 100 --lr 4.096 --optimizer-batch-size 4096 --warmup 16 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 128 --epochs 50

View file

@ -1 +0,0 @@
python ./multiproc.py --nproc_per_node 16 ./main.py /imagenet --data-backend dali-gpu --raport-file raport.json -j8 -p 100 --lr 4.096 --optimizer-batch-size 4096 --warmup 16 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 128 --epochs 90

View file

@ -1 +0,0 @@
python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --data-backend dali-cpu --raport-file raport.json -j16 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 256 --epochs 90

View file

@ -0,0 +1 @@
python ./launch.py --model resnet50 --precision TF32 --mode convergence --platform DGXA100 /imagenet --workspace ${1:-./} --raport-file raport.json

View file

@ -0,0 +1 @@
python ./launch.py --model resnet50 --precision TF32 --mode convergence --platform DGXA100 /imagenet --epochs 90 --mixup 0.0 --workspace ${1:-./} --raport-file raport.json

View file

@ -31,11 +31,11 @@ achieve state-of-the-art accuracy, and is tested and maintained by NVIDIA.
* [Inference performance benchmark](#inference-performance-benchmark)
* [Results](#results)
* [Training accuracy results](#training-accuracy-results)
* [Training accuracy: NVIDIA DGX A100 (8x A100 40GB)](#training-accuracy-nvidia-dgx-a100-8x-a100-40gb)
* [Training accuracy: NVIDIA DGX A100 (8x A100 80GB)](#training-accuracy-nvidia-dgx-a100-8x-a100-80gb)
* [Training accuracy: NVIDIA DGX-1 (8x V100 16GB)](#training-accuracy-nvidia-dgx-1-8x-v100-16gb)
* [Example plots](#example-plots)
* [Training performance results](#training-performance-results)
* [Training performance: NVIDIA DGX A100 (8x A100 40GB)](#training-performance-nvidia-dgx-a100-8x-a100-40gb)
* [Training performance: NVIDIA DGX A100 (8x A100 80GB)](#training-performance-nvidia-dgx-a100-8x-a100-80gb)
* [Training performance: NVIDIA DGX-1 16GB (8x V100 16GB)](#training-performance-nvidia-dgx-1-16gb-8x-v100-16gb)
* [Training performance: NVIDIA DGX-1 32GB (8x V100 32GB)](#training-performance-nvidia-dgx-1-32gb-8x-v100-32gb)
* [Inference performance results](#inference-performance-results)
@ -111,7 +111,7 @@ The following features are supported by this model:
| Feature | ResNeXt101-32x4d
|-----------------------|--------------------------
|[DALI](https://docs.nvidia.com/deeplearning/dali/release-notes/index.html) | Yes
|[DALI](https://docs.nvidia.com/deeplearning/sdk/dali-release-notes/index.html) | Yes
|[APEX AMP](https://nvidia.github.io/apex/amp.html) | Yes |
#### Features
@ -128,11 +128,11 @@ which speeds up data loading when CPU becomes a bottleneck.
DALI can use CPU or GPU, and outperforms the PyTorch native dataloader.
Run training with `--data-backends dali-gpu` or `--data-backends dali-cpu` to enable DALI.
For ResNeXt101-32x4d, for DGXA100, DGX1 and DGX2 we recommend `--data-backends dali-cpu`.
For DGXA100 and DGX1 we recommend `--data-backends dali-cpu`.
### Mixed precision training
Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores) in the Volta and Turing architecture, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:
Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores) in Volta, and following with both the Turing and Ampere architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:
1. Porting the model to use the FP16 data type where appropriate.
2. Adding loss scaling to preserve small gradient values.
@ -190,7 +190,7 @@ The following section lists the requirements that you need to meet in order to s
This repository contains Dockerfile which extends the PyTorch NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
* [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
* [PyTorch 20.06-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch) or newer
* [PyTorch 20.12-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch) or newer
* Supported GPUs:
* [NVIDIA Volta architecture](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
* [NVIDIA Turing architecture](https://www.nvidia.com/en-us/geforce/turing/)
@ -242,27 +242,27 @@ For the specifics concerning training and inference, see the [Advanced](#advance
The directory in which the `train/` and `val/` directories are placed, is referred to as `<path to imagenet>` in this document.
### 3. Build the RNXT101-32x4d PyTorch NGC container.
### 3. Build the ResNeXt101-32x4d PyTorch NGC container.
```
docker build . -t nvidia_rnxt101-32x4d
docker build . -t nvidia_resnext101-32x4d
```
### 4. Start an interactive session in the NGC container to run training/inference.
```
nvidia-docker run --rm -it -v <path to imagenet>:/imagenet --ipc=host nvidia_rnxt101-32x4d
nvidia-docker run --rm -it -v <path to imagenet>:/imagenet --ipc=host nvidia_resnext101-32x4d
```
### 5. Start training
To run training for a standard configuration (DGXA100/DGX1/DGX2, AMP/TF32/FP32, 90/250 Epochs),
To run training for a standard configuration (DGXA100/DGX1V, AMP/TF32/FP32, 90/250 Epochs),
run one of the scripts in the `./resnext101-32x4d/training` directory
called `./resnext101-32x4d/training/{AMP, TF32, FP32}/{DGXA100, DGX1, DGX2}_RNXT101-32x4d_{AMP, TF32, FP32}_{90,250}E.sh`.
called `./resnext101-32x4d/training/{AMP, TF32, FP32}/{ DGXA100, DGX1V }_resnext101-32x4d_{AMP, TF32, FP32}_{ 90, 250 }E.sh`.
Ensure ImageNet is mounted in the `/imagenet` directory.
Example:
`bash ./resnext101-32x4d/training/AMP/DGX1_RNXT101-32x4d_AMP_250E.sh <path were to store checkpoints and logs>`
`bash ./resnext101-32x4d/training/AMP/DGX1_resnext101-32x4d_AMP_250E.sh <path were to store checkpoints and logs>`
### 6. Start inference
@ -280,7 +280,7 @@ To run inference on ImageNet, run:
To run inference on JPEG image using pretrained weights:
`python classify.py --arch resnext101-32x4d -c fanin --weights nvidia_resnext101-32x4d_200821.pth.tar --precision AMP|FP32 --image <path to JPEG image>`
`python classify.py --arch resnext101-32x4d -c fanin --weights nvidia_resnext101-32x4d_200821.pth.tar --precision AMP|FP32 --image <path to JPEG image>`
## Advanced
@ -319,7 +319,7 @@ usage: main.py [-h] [--data-backend BACKEND] [--arch ARCH]
[--lr-schedule SCHEDULE] [--warmup E] [--label-smoothing S]
[--mixup ALPHA] [--momentum M] [--weight-decay W]
[--bn-weight-decay] [--nesterov] [--print-freq N]
[--resume PATH] [--pretrained-weights PATH] [--fp16]
[--resume PATH] [--pretrained-weights PATH]
[--static-loss-scale STATIC_LOSS_SCALE] [--dynamic-loss-scale]
[--prof N] [--amp] [--seed SEED] [--gather-checkpoints]
[--raport-file RAPORT_FILE] [--evaluate] [--training-only]
@ -338,8 +338,10 @@ optional arguments:
data backend: pytorch | syntetic | dali-gpu | dali-cpu
(default: dali-cpu)
--arch ARCH, -a ARCH model architecture: resnet18 | resnet34 | resnet50 |
resnet101 | resnet152 | resnext101-32x4d | se-
resnext101-32x4d (default: resnet50)
resnet101 | resnet152 | resnext50-32x4d |
resnext101-32x4d | resnext101-32x8d |
resnext101-32x8d-basic | se-resnext101-32x4d (default:
resnet50)
--model-config CONF, -c CONF
model configs: classic | fanin | grp-fanin | grp-
fanout(default: classic)
@ -368,10 +370,9 @@ optional arguments:
--resume PATH path to latest checkpoint (default: none)
--pretrained-weights PATH
load weights from here
--fp16 Run model fp16 mode.
--static-loss-scale STATIC_LOSS_SCALE
Static loss scale, positive power of 2 values can
improve fp16 convergence.
improve amp convergence.
--dynamic-loss-scale Use dynamic loss scaling. If supplied, this argument
supersedes --static-loss-scale.
--prof N Run only N iterations
@ -399,25 +400,7 @@ To use your own dataset, divide it in directories as in the following scheme:
- Training images - `train/<class id>/<image>`
- Validation images - `val/<class id>/<image>`
If your dataset's has number of classes different than 1000, you need to add a custom config
in the `image_classification/resnet.py` file.
```python
resnet_versions = {
...
'resnext101-32x4d-custom' : {
'net' : ResNet,
'block' : Bottleneck,
'cardinality' : 32,
'layers' : [3, 4, 23, 3],
'widths' : [128, 256, 512, 1024],
'expansion' : 2,
'num_classes' : <custom number of classes>,
}
}
```
After adding the config, run the training script with `--arch resnext101-32x4d-custom` flag.
If your dataset's has number of classes different than 1000, you need to pass `--num-classes N` flag to the training script.
### Training process
@ -440,7 +423,7 @@ To restart training from checkpoint use `--resume` option.
To start training from pretrained weights (e.g. downloaded from NGC) use `--pretrained-weights` option.
The difference between those two is that the pretrained weights contain only model weights,
and checkpoints, apart from model weights, contain optimizer state, LR scheduler state, RNG state.
and checkpoints, apart from model weights, contain optimizer state, LR scheduler state.
Checkpoints are suitable for dividing the training into parts, for example in order
to divide the training job into shorter stages, or restart training after infrastructure fail.
@ -482,9 +465,9 @@ You can also run ImageNet validation on pretrained weights:
Pretrained weights can be downloaded from NGC:
```bash
wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/resnext101-32x4d_pyt_amp/versions/20.06.0/zip -O resnext101-32x4d_pyt_amp_20.06.0.zip
wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/resnext101_32x4d_pyt_amp/versions/20.06.0/zip -O resnext101_32x4d_pyt_amp_20.06.0.zip
unzip resnext101-32x4d_pyt_amp_20.06.0.zip
unzip resnext101_32x4d_pyt_amp_20.06.0.zip
```
To run inference on ImageNet, run:
@ -493,7 +476,7 @@ To run inference on ImageNet, run:
To run inference on JPEG image using pretrained weights:
`python classify.py --arch resnext101-32x4d -c fanin --weights nvidia_resnext101-32x4d_200821.pth.tar --precision AMP|FP32 --image <path to JPEG image>`
`python classify.py --arch resnext101-32x4d --weights nvidia_resnext101-32x4d_200821.pth.tar --precision AMP|FP32 --image <path to JPEG image>`
## Performance
@ -507,71 +490,62 @@ The following section shows how to run benchmarks measuring the model performanc
To benchmark training, run:
* For 1 GPU
* FP32
`python ./main.py --arch resnext101-32x4d -b <batch_size> --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 <path to imagenet>`
* FP32 (V100 GPUs only)
`python ./launch.py --model resnext101-32x4d --precision FP32 --mode benchmark_training --platform DGX1V <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
* TF32 (A100 GPUs only)
`python ./launch.py --model resnext101-32x4d --precision TF32 --mode benchmark_training --platform DGXA100 <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
* AMP
`python ./main.py --arch resnext101-32x4d -b <batch_size> --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --amp --static-loss-scale 256 --memory-format nhwc <path to imagenet>`
`python ./launch.py --model resnext101-32x4d --precision AMP --mode benchmark_training --platform <DGX1V|DGXA100> <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
* For multiple GPUs
* FP32
`python ./multiproc.py --nproc_per_node 8 ./main.py --arch resnext101-32x4d -b <batch_size> --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 <path to imagenet>`
* FP32 (V100 GPUs only)
`python ./launch.py --model resnext101-32x4d --precision FP32 --mode benchmark_training --platform DGX1V <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
* TF32 (A100 GPUs only)
`python ./multiproc.py --nproc_per_node 8 ./launch.py --model resnext101-32x4d --precision TF32 --mode benchmark_training --platform DGXA100 <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
* AMP
`python ./multiproc.py --nproc_per_node 8 ./main.py --arch resnext101-32x4d -b <batch_size> --training-only -p 1 --raport-file benchmark.json --amp --static-loss-scale 256 --epochs 1 --prof 100 --memory-format nhwc <path to imagenet>`
`python ./multiproc.py --nproc_per_node 8 ./launch.py --model resnext101-32x4d --precision AMP --mode benchmark_training --platform <DGX1V|DGXA100> <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
Each of these scripts will run 100 iterations and save results in the `benchmark.json` file.
Batch size should be picked appropriately depending on the hardware configuration.
| *Platform* | *Precision* | *Batch Size* |
|:----------:|:-----------:|:------------:|
| DGXA100 | AMP | 128 |
| DGXA100 | TF32 | 128 |
| DGX-1 | AMP | 128 |
| DGX-1 | FP32 | 64 |
#### Inference performance benchmark
To benchmark inference, run:
* FP32
* FP32 (V100 GPUs only)
`python ./main.py --arch resnext101-32x4d -b <batch_size> -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --evaluate <path to imagenet>`
`python ./launch.py --model resnext101-32x4d --precision FP32 --mode benchmark_inference --platform DGX1V <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
* TF32 (A100 GPUs only)
`python ./launch.py --model resnext101-32x4d --precision FP32 --mode benchmark_inference --platform DGXA100 <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
* AMP
`python ./main.py --arch resnext101-32x4d -b <batch_size> -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --evaluate --amp --memory-format nhwc <path to imagenet>`
`python ./launch.py --model resnext101-32x4d --precision AMP --mode benchmark_inference --platform <DGX1V|DGXA100> <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
Each of these scripts will run 100 iterations and save results in the `benchmark.json` file.
Batch size should be picked appropriately depending on the hardware configuration.
| *Platform* | *Precision* | *Batch Size* |
|:----------:|:-----------:|:------------:|
| DGXA100 | AMP | 128 |
| DGXA100 | TF32 | 128 |
| DGX-1 | AMP | 128 |
| DGX-1 | FP32 | 64 |
### Results
Our results were obtained by running the applicable training script in the pytorch-20.06 NGC container.
Our results were obtained by running the applicable training script in the pytorch-20.12 NGC container.
To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
#### Training accuracy results
##### Training accuracy: NVIDIA DGX A100 (8x A100 40GB)
##### Training accuracy: NVIDIA DGX A100 (8x A100 80GB)
| **Epochs** | **Mixed Precision Top1** | **TF32 Top1** |
|:----------:|:------------------------:|:--------------:|
| 90 | 79.47 +/- 0.03 | 79.38 +/- 0.07 |
| 250 | 80.19 +/- 0.08 | 80.27 +/- 0.1 |
| **epochs** | **Mixed Precision Top1** | **TF32 Top1** |
|:------:|:--------------------:|:--------------:|
| 90 | 79.37 +/- 0.13 | 79.38 +/- 0.13 |
##### Training accuracy: NVIDIA DGX-1 (8x V100 16GB)
| **epochs** | **Mixed Precision Top1** | **FP32 Top1** |
|:-:|:-:|:-:|
| 90 | 79.43 +/- 0.04 | 79.40 +/- 0.10 |
| 250 | 79.92 +/- 0.13 | 80.06 +/- 0.06 |
| **Epochs** | **Mixed Precision Top1** | **FP32 Top1** |
|:----------:|:------------------------:|:--------------:|
| 90 | 79.49 +/- 0.05 | 79.40 +/- 0.10 |
| 250 | 80.26 +/- 0.11 | 80.06 +/- 0.06 |
##### Example plots
@ -586,26 +560,29 @@ The following images show a 250 epochs configuration on a DGX-1V.
#### Training performance results
##### Training performance: NVIDIA DGX A100 (8x A100 40GB)
##### Training performance: NVIDIA DGX A100 (8x A100 80GB)
| **GPUs** | **Mixed Precision** | **TF32** | **Mixed Precision Speedup** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **TF32 Strong Scaling** | **TF32 Training Time (90E)** |
|:--------:|:-------------------:|:----------:|:---------------------------:|:----------------------------------:|:---------------------------------------:|:-----------------------:|:----------------------------:|
| 1 | 1169 img/s | 420 img/s | 2.77 x | 1.0 x | ~29 hours | 1.0 x | ~80 hours |
| 8 | 7399 img/s | 3193 img/s | 2.31 x | 6.32 x | ~5 hours | 7.58 x | ~11 hours |
|**GPUs**|**Mixed Precision**| **TF32** |**Mixed Precision Speedup**|**Mixed Precision Strong Scaling**|**Mixed Precision Training Time (90E)**|**TF32 Strong Scaling**|**TF32 Training Time (90E)**|
|:------:|:-----------------:|:-----------:|:-------------------------:|:--------------------------------:|:-------------------------------------:|:---------------------:|:--------------------------:|
| 1 | 908.40 img/s |300.42 img/s | 3.02x | 1.00x | ~37 hours | 1.00x | ~111 hours |
| 8 | 6887.59 img/s |2380.51 img/s| 2.89x | 7.58x | ~5 hours | 7.92x | ~14 hours |
##### Training performance: NVIDIA DGX-1 16GB (8x V100 16GB)
|**GPUs**|**Mixed Precision**| **FP32** |**Mixed Precision Speedup**|**Mixed Precision Strong Scaling**|**Mixed Precision Training Time (90E)**|**FP32 Strong Scaling**|**FP32 Training Time (90E)**|
|:------:|:-----------------:|:-----------:|:-------------------------:|:--------------------------------:|:-------------------------------------:|:---------------------:|:--------------------------:|
| 1 | 534.91 img/s |150.05 img/s | 3.56x | 1.00x | ~62 hours | 1.00x | ~222 hours |
| 8 | 4000.79 img/s |1151.01 img/s| 3.48x | 7.48x | ~9 hours | 7.67x | ~29 hours |
| **GPUs** | **Mixed Precision** | **FP32** | **Mixed Precision Speedup** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **FP32 Strong Scaling** | **FP32 Training Time (90E)** |
|:--------:|:-------------------:|:----------:|:---------------------------:|:----------------------------------:|:---------------------------------------:|:-----------------------:|:----------------------------:|
| 1 | 578 img/s | 149 img/s | 3.86 x | 1.0 x | ~59 hours | 1.0 x | ~225 hours |
| 8 | 3742 img/s | 1117 img/s | 3.34 x | 6.46 x | ~9 hours | 7.45 x | ~31 hours |
##### Training performance: NVIDIA DGX-1 32GB (8x V100 32GB)
|**GPUs**|**Mixed Precision**| **FP32** |**Mixed Precision Speedup**|**Mixed Precision Strong Scaling**|**Mixed Precision Training Time (90E)**|**FP32 Strong Scaling**|**FP32 Training Time (90E)**|
|:------:|:-----------------:|:-----------:|:-------------------------:|:--------------------------------:|:-------------------------------------:|:---------------------:|:--------------------------:|
| 1 | 516.07 img/s |139.80 img/s | 3.69x | 1.00x | ~65 hours | 1.00x | ~238 hours |
| 8 | 3861.95 img/s |1070.94 img/s| 3.61x | 7.48x | ~9 hours | 7.66x | ~31 hours |
| **GPUs** | **Mixed Precision** | **FP32** | **Mixed Precision Speedup** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **FP32 Strong Scaling** | **FP32 Training Time (90E)** |
|:--------:|:-------------------:|:----------:|:---------------------------:|:----------------------------------:|:---------------------------------------:|:-----------------------:|:----------------------------:|
| 1 | 556 img/s | 151 img/s | 3.68 x | 1.0 x | ~61 hours | 1.0 x | ~223 hours |
| 8 | 3595 img/s | 1102 img/s | 3.26 x | 6.45 x | ~10 hours | 7.28 x | ~31 hours |
#### Inference performance results
@ -613,62 +590,64 @@ The following images show a 250 epochs configuration on a DGX-1V.
###### FP32 Inference Latency
| **batch size** | **Throughput Avg** | **Latency Avg** | **Latency 90%** | **Latency 95%** | **Latency 99%** |
|:-:|:-:|:-:|:-:|:-:|:-:|
| 1 | 47.34 img/s | 21.02ms | 23.41ms | 24.55ms | 26.00ms |
| 2 | 89.68 img/s | 22.14ms | 22.90ms | 24.86ms | 26.59ms |
| 4 | 175.92 img/s | 22.57ms | 24.96ms | 25.53ms | 26.03ms |
| 8 | 325.69 img/s | 24.35ms | 25.17ms | 25.80ms | 28.52ms |
| 16 | 397.04 img/s | 40.04ms | 40.01ms | 40.08ms | 40.32ms |
| 32 | 431.77 img/s | 73.71ms | 74.05ms | 74.09ms | 74.26ms |
| 64 | 485.70 img/s | 131.04ms | 131.38ms | 131.53ms | 131.81ms |
| 128 | N/A | N/A | N/A | N/A | N/A |
| **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
|:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
| 1 | 55 img/s | 18.48 ms | 18.88 ms | 20.74 ms |
| 2 | 116 img/s | 17.54 ms | 18.15 ms | 21.32 ms |
| 4 | 214 img/s | 19.07 ms | 20.44 ms | 22.69 ms |
| 8 | 291 img/s | 27.8 ms | 27.99 ms | 28.47 ms |
| 16 | 354 img/s | 45.78 ms | 45.4 ms | 45.73 ms |
| 32 | 423 img/s | 77.13 ms | 75.96 ms | 76.21 ms |
| 64 | 486 img/s | 134.92 ms | 132.17 ms | 132.51 ms |
| 128 | 523 img/s | 252.11 ms | 244.5 ms | 244.99 ms |
| 256 | 530 img/s | 499.64 ms | 479.83 ms | 481.41 ms |
###### Mixed Precision Inference Latency
| **batch size** | **Throughput Avg** | **Latency Avg** | **Latency 90%** | **Latency 95%** | **Latency 99%** |
|:-:|:-:|:-:|:-:|:-:|:-:|
| 1 | 43.11 img/s | 23.05ms | 25.19ms | 25.41ms | 26.63ms |
| 2 | 83.29 img/s | 23.82ms | 25.11ms | 26.25ms | 27.29ms |
| 4 | 173.67 img/s | 22.82ms | 24.38ms | 25.26ms | 25.92ms |
| 8 | 330.18 img/s | 24.05ms | 26.45ms | 27.37ms | 27.74ms |
| 16 | 634.82 img/s | 25.00ms | 26.93ms | 28.12ms | 28.73ms |
| 32 | 884.91 img/s | 35.71ms | 35.96ms | 36.01ms | 36.13ms |
| 64 | 998.40 img/s | 63.43ms | 63.63ms | 63.75ms | 63.96ms |
| 128 | 1079.10 img/s | 117.74ms | 118.02ms | 118.11ms | 118.35ms |
| **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
|:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
| 1 | 40 img/s | 25.17 ms | 28.4 ms | 30.66 ms |
| 2 | 89 img/s | 22.64 ms | 24.29 ms | 25.99 ms |
| 4 | 165 img/s | 24.54 ms | 26.23 ms | 28.61 ms |
| 8 | 334 img/s | 24.31 ms | 28.46 ms | 29.91 ms |
| 16 | 632 img/s | 25.8 ms | 27.76 ms | 29.53 ms |
| 32 | 1219 img/s | 27.35 ms | 29.86 ms | 31.6 ms |
| 64 | 1525 img/s | 43.97 ms | 42.01 ms | 42.96 ms |
| 128 | 1647 img/s | 82.22 ms | 77.65 ms | 79.56 ms |
| 256 | 1689 img/s | 161.53 ms | 151.25 ms | 152.01 ms |
##### Inference performance: NVIDIA T4
###### FP32 Inference Latency
| **batch size** | **Throughput Avg** | **Latency Avg** | **Latency 90%** | **Latency 95%** | **Latency 99%** |
|:-:|:-:|:-:|:-:|:-:|:-:|
| 1 | 55.64 img/s | 17.88ms | 19.21ms | 20.35ms | 22.29ms |
| 2 | 109.22 img/s | 18.24ms | 19.00ms | 20.43ms | 22.51ms |
| 4 | 217.27 img/s | 18.26ms | 18.88ms | 19.51ms | 21.74ms |
| 8 | 294.55 img/s | 26.74ms | 27.35ms | 27.62ms | 28.93ms |
| 16 | 351.30 img/s | 45.34ms | 45.72ms | 46.10ms | 47.43ms |
| 32 | 401.97 img/s | 79.10ms | 79.37ms | 79.44ms | 81.83ms |
| 64 | 449.30 img/s | 140.30ms | 140.73ms | 141.26ms | 143.57ms |
| 128 | N/A | N/A | N/A | N/A | N/A |
| **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
|:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
| 1 | 79 img/s | 13.07 ms | 14.66 ms | 15.59 ms |
| 2 | 119 img/s | 17.21 ms | 18.07 ms | 19.78 ms |
| 4 | 141 img/s | 28.65 ms | 28.62 ms | 28.77 ms |
| 8 | 139 img/s | 57.84 ms | 58.29 ms | 58.62 ms |
| 16 | 153 img/s | 104.8 ms | 105.65 ms | 106.2 ms |
| 32 | 178 img/s | 181.24 ms | 180.96 ms | 181.57 ms |
| 64 | 179 img/s | 360.93 ms | 358.22 ms | 359.11 ms |
| 128 | 177 img/s | 735.99 ms | 726.15 ms | 727.81 ms |
| 256 | 167 img/s | 1561.91 ms | 1523.52 ms | 1525.96 ms |
###### Mixed Precision Inference Latency
| **batch size** | **Throughput Avg** | **Latency Avg** | **Latency 90%** | **Latency 95%** | **Latency 99%** |
|:-:|:-:|:-:|:-:|:-:|:-:|
| 1 | 51.14 img/s | 19.48ms | 20.16ms | 21.40ms | 26.21ms |
| 2 | 102.29 img/s | 19.44ms | 19.77ms | 20.42ms | 24.51ms |
| 4 | 209.44 img/s | 18.93ms | 19.52ms | 20.23ms | 21.95ms |
| 8 | 408.69 img/s | 19.47ms | 21.12ms | 23.15ms | 25.77ms |
| 16 | 641.78 img/s | 24.54ms | 25.19ms | 25.64ms | 27.31ms |
| 32 | 800.26 img/s | 39.28ms | 39.43ms | 39.54ms | 41.96ms |
| 64 | 883.66 img/s | 71.76ms | 71.87ms | 71.94ms | 72.78ms |
| 128 | 948.27 img/s | 134.19ms | 134.40ms | 134.58ms | 134.81ms |
| **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
|:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
| 1 | 65 img/s | 15.69 ms | 16.95 ms | 17.97 ms |
| 2 | 126 img/s | 16.2 ms | 16.78 ms | 18.6 ms |
| 4 | 245 img/s | 16.77 ms | 18.35 ms | 25.88 ms |
| 8 | 488 img/s | 16.82 ms | 17.86 ms | 25.45 ms |
| 16 | 541 img/s | 30.16 ms | 29.95 ms | 30.18 ms |
| 32 | 566 img/s | 57.79 ms | 57.11 ms | 57.29 ms |
| 64 | 580 img/s | 112.84 ms | 111.07 ms | 111.56 ms |
| 128 | 586 img/s | 224.75 ms | 219.12 ms | 219.64 ms |
| 256 | 589 img/s | 447.25 ms | 434.18 ms | 439.22 ms |
## Release notes
@ -680,9 +659,10 @@ The following images show a 250 epochs configuration on a DGX-1V.
2. July 2020
* Added A100 scripts
* Updated README
3. February 2021
* Moved from APEX AMP to Native AMP
### Known issues
There are no known issues with this model.

View file

@ -0,0 +1 @@
python ./launch.py --model resnext101-32x4d --precision AMP --mode convergence --platform DGX1V /imagenet --workspace ${1:-./} --raport-file raport.json

View file

@ -0,0 +1 @@
python ./launch.py --model resnext101-32x4d --precision AMP --mode convergence --platform DGX1V /imagenet --epochs 90 --mixup 0.0 --workspace ${1:-./} --raport-file raport.json

View file

@ -1 +0,0 @@
python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j8 -p 100 --data-backend dali-cpu --arch resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 128 --amp --static-loss-scale 128 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs 250 --warmup 8 --wd 6.103515625e-05 --mixup 0.2 --memory-format nhwc

View file

@ -1 +0,0 @@
python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j8 -p 100 --data-backend dali-cpu --arch resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 128 --amp --static-loss-scale 128 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs 90 --warmup 8 --wd 6.103515625e-05 --memory-format nhwc

View file

@ -1 +0,0 @@
python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j16 -p 100 --data-backend dali-cpu --arch resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 128 --amp --static-loss-scale 128 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs 90 --warmup 8 --wd 6.103515625e-05 --memory-format nhwc

View file

@ -0,0 +1 @@
python ./launch.py --model resnext101-32x4d --precision AMP --mode convergence --platform DGXA100 /imagenet --workspace ${1:-./} --raport-file raport.json

View file

@ -0,0 +1 @@
python ./launch.py --model resnext101-32x4d --precision AMP --mode convergence --platform DGXA100 /imagenet --epochs 90 --mixup 0.0 --workspace ${1:-./} --raport-file raport.json

View file

@ -0,0 +1 @@
python ./launch.py --model resnext101-32x4d --precision FP32 --mode convergence --platform DGX1V /imagenet --workspace ${1:-./} --raport-file raport.json

View file

@ -0,0 +1 @@
python ./launch.py --model resnext101-32x4d --precision FP32 --mode convergence --platform DGX1V /imagenet --epochs 90 --mixup 0.0 --workspace ${1:-./} --raport-file raport.json

View file

@ -1 +0,0 @@
python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j8 -p 100 --data-backend dali-cpu --arch resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 64 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs 250 --warmup 8 --wd 6.103515625e-05 --mixup 0.2

View file

@ -1 +0,0 @@
python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j8 -p 100 --data-backend dali-cpu --arch resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 64 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs 90 --warmup 8 --wd 6.103515625e-05

View file

@ -1 +0,0 @@
python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j16 -p 100 --data-backend dali-cpu --arch resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 128 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs 90 --warmup 8 --wd 6.103515625e-05

View file

@ -0,0 +1 @@
python ./launch.py --model resnext101-32x4d --precision TF32 --mode convergence --platform DGXA100 /imagenet --workspace ${1:-./} --raport-file raport.json

View file

@ -0,0 +1 @@
python ./launch.py --model resnext101-32x4d --precision TF32 --mode convergence --platform DGXA100 /imagenet --epochs 90 --mixup 0.0 --workspace ${1:-./} --raport-file raport.json

View file

@ -31,11 +31,11 @@ achieve state-of-the-art accuracy, and is tested and maintained by NVIDIA.
* [Inference performance benchmark](#inference-performance-benchmark)
* [Results](#results)
* [Training accuracy results](#training-accuracy-results)
* [Training accuracy: NVIDIA DGX A100 (8x A100 40GB)](#training-accuracy-nvidia-dgx-a100-8x-a100-40gb)
* [Training accuracy: NVIDIA DGX A100 (8x A100 80GB)](#training-accuracy-nvidia-dgx-a100-8x-a100-80gb)
* [Training accuracy: NVIDIA DGX-1 (8x V100 16GB)](#training-accuracy-nvidia-dgx-1-8x-v100-16gb)
* [Example plots](#example-plots)
* [Training performance results](#training-performance-results)
* [Training performance: NVIDIA DGX A100 (8x A100 40GB)](#training-performance-nvidia-dgx-a100-8x-a100-40gb)
* [Training performance: NVIDIA DGX A100 (8x A100 80GB)](#training-performance-nvidia-dgx-a100-8x-a100-80gb)
* [Training performance: NVIDIA DGX-1 16GB (8x V100 16GB)](#training-performance-nvidia-dgx-1-16gb-8x-v100-16gb)
* [Training performance: NVIDIA DGX-1 32GB (8x V100 32GB)](#training-performance-nvidia-dgx-1-32gb-8x-v100-32gb)
* [Inference performance results](#inference-performance-results)
@ -45,7 +45,6 @@ achieve state-of-the-art accuracy, and is tested and maintained by NVIDIA.
* [Changelog](#changelog)
* [Known issues](#known-issues)
## Model overview
The SE-ResNeXt101-32x4d is a [ResNeXt101-32x4d](https://arxiv.org/pdf/1611.05431.pdf)
@ -106,13 +105,14 @@ This model uses the following data augmentation:
* Scale to 256x256
* Center crop to 224x224
### Feature support matrix
The following features are supported by this model:
| Feature | ResNeXt101-32x4d
| Feature | SE-ResNeXt101-32x4d
|-----------------------|--------------------------
|[DALI](https://docs.nvidia.com/deeplearning/dali/release-notes/index.html) | Yes
|[DALI](https://docs.nvidia.com/deeplearning/sdk/dali-release-notes/index.html) | Yes
|[APEX AMP](https://nvidia.github.io/apex/amp.html) | Yes |
#### Features
@ -129,11 +129,11 @@ which speeds up data loading when CPU becomes a bottleneck.
DALI can use CPU or GPU, and outperforms the PyTorch native dataloader.
Run training with `--data-backends dali-gpu` or `--data-backends dali-cpu` to enable DALI.
For ResNeXt101-32x4d, for DGX1 and DGX2 we recommend `--data-backends dali-cpu`.
For DGXA100 and DGX1 we recommend `--data-backends dali-cpu`.
### Mixed precision training
Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores) in the Volta and Turing architecture, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:
Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores) in Volta, and following with both the Turing and Ampere architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:
1. Porting the model to use the FP16 data type where appropriate.
2. Adding loss scaling to preserve small gradient values.
@ -191,7 +191,7 @@ The following section lists the requirements that you need to meet in order to s
This repository contains Dockerfile which extends the PyTorch NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
* [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
* [PyTorch 20.06-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch) or newer
* [PyTorch 20.12-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch) or newer
* Supported GPUs:
* [NVIDIA Volta architecture](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
* [NVIDIA Turing architecture](https://www.nvidia.com/en-us/geforce/turing/)
@ -216,7 +216,7 @@ cd DeepLearningExamples/PyTorch/Classification/
### 2. Download and preprocess the dataset.
The ResNeXt101-32x4d script operates on ImageNet 1k, a widely popular image classification dataset from the ILSVRC challenge.
The SE-ResNeXt101-32x4d script operates on ImageNet 1k, a widely popular image classification dataset from the ILSVRC challenge.
PyTorch can work directly on JPEGs, therefore, preprocessing/augmentation is not needed.
@ -243,27 +243,28 @@ For the specifics concerning training and inference, see the [Advanced](#advance
The directory in which the `train/` and `val/` directories are placed, is referred to as `<path to imagenet>` in this document.
### 3. Build the SE-RNXT101-32x4d PyTorch NGC container.
### 3. Build the SE-ResNeXt101-32x4d PyTorch NGC container.
```
docker build . -t nvidia_se-rnxt101-32x4d
docker build . -t nvidia_se-resnext101-32x4d
```
### 4. Start an interactive session in the NGC container to run training/inference.
```
nvidia-docker run --rm -it -v <path to imagenet>:/imagenet --ipc=host nvidia_se-rnxt101-32x4d
nvidia-docker run --rm -it -v <path to imagenet>:/imagenet --ipc=host nvidia_se-resnext101-32x4d
```
### 5. Start training
To run training for a standard configuration (DGXA100/DGX1/DGX2, AMP/TF32/FP32, 90/250 Epochs),
To run training for a standard configuration (DGXA100/DGX1V, AMP/TF32/FP32, 90/250 Epochs),
run one of the scripts in the `./se-resnext101-32x4d/training` directory
called `./se-resnext101-32x4d/training/{AMP, TF32, FP32}/{DGXA100, DGX1, DGX2}_SE-RNXT101-32x4d_{AMP, TF32, FP32}_{90,250}E.sh`.
called `./se-resnext101-32x4d/training/{AMP, TF32, FP32}/{ DGXA100, DGX1V }_se-resnext101-32x4d_{AMP, TF32, FP32}_{ 90, 250 }E.sh`.
Ensure ImageNet is mounted in the `/imagenet` directory.
Example:
`bash ./se-resnext101-32x4d/training/AMP/DGX1_SE-RNXT101-32x4d_AMP_250E.sh <path were to store checkpoints and logs>`
`bash ./se-resnext101-32x4d/training/AMP/DGX1_se-resnext101-32x4d_AMP_250E.sh <path were to store checkpoints and logs>`
### 6. Start inference
@ -281,7 +282,7 @@ To run inference on ImageNet, run:
To run inference on JPEG image using pretrained weights:
`python classify.py --arch se-resnext101-32x4d -c fanin --weights nvidia_se-resnext101-32x4d_200821.pth.tar --precision AMP|FP32 --image <path to JPEG image>`
`python classify.py --arch se-resnext101-32x4d -c fanin --weights nvidia_se-resnext101-32x4d_200821.pth.tar --precision AMP|FP32 --image <path to JPEG image>`
## Advanced
@ -320,7 +321,7 @@ usage: main.py [-h] [--data-backend BACKEND] [--arch ARCH]
[--lr-schedule SCHEDULE] [--warmup E] [--label-smoothing S]
[--mixup ALPHA] [--momentum M] [--weight-decay W]
[--bn-weight-decay] [--nesterov] [--print-freq N]
[--resume PATH] [--pretrained-weights PATH] [--fp16]
[--resume PATH] [--pretrained-weights PATH]
[--static-loss-scale STATIC_LOSS_SCALE] [--dynamic-loss-scale]
[--prof N] [--amp] [--seed SEED] [--gather-checkpoints]
[--raport-file RAPORT_FILE] [--evaluate] [--training-only]
@ -339,8 +340,10 @@ optional arguments:
data backend: pytorch | syntetic | dali-gpu | dali-cpu
(default: dali-cpu)
--arch ARCH, -a ARCH model architecture: resnet18 | resnet34 | resnet50 |
resnet101 | resnet152 | resnext101-32x4d | se-
resnext101-32x4d (default: resnet50)
resnet101 | resnet152 | resnext50-32x4d |
resnext101-32x4d | resnext101-32x8d |
resnext101-32x8d-basic | se-resnext101-32x4d (default:
resnet50)
--model-config CONF, -c CONF
model configs: classic | fanin | grp-fanin | grp-
fanout(default: classic)
@ -369,10 +372,9 @@ optional arguments:
--resume PATH path to latest checkpoint (default: none)
--pretrained-weights PATH
load weights from here
--fp16 Run model fp16 mode.
--static-loss-scale STATIC_LOSS_SCALE
Static loss scale, positive power of 2 values can
improve fp16 convergence.
improve amp convergence.
--dynamic-loss-scale Use dynamic loss scaling. If supplied, this argument
supersedes --static-loss-scale.
--prof N Run only N iterations
@ -390,6 +392,7 @@ optional arguments:
--workspace DIR path to directory where checkpoints will be stored
--memory-format {nchw,nhwc}
memory layout, nchw or nhwc
```
@ -400,25 +403,7 @@ To use your own dataset, divide it in directories as in the following scheme:
- Training images - `train/<class id>/<image>`
- Validation images - `val/<class id>/<image>`
If your dataset's has number of classes different than 1000, you need to add a custom config
in the `image_classification/resnet.py` file.
```python
resnet_versions = {
...
'se-resnext101-32x4d-custom' : {
'net' : ResNet,
'block' : SEBottleneck,
'cardinality' : 32,
'layers' : [3, 4, 23, 3],
'widths' : [128, 256, 512, 1024],
'expansion' : 2,
'num_classes' : <custom number of classes>,
}
}
```
After adding the config, run the training script with `--arch resnext101-32x4d-custom` flag.
If your dataset's has number of classes different than 1000, you need to pass `--num-classes N` flag to the training script.
### Training process
@ -441,7 +426,7 @@ To restart training from checkpoint use `--resume` option.
To start training from pretrained weights (e.g. downloaded from NGC) use `--pretrained-weights` option.
The difference between those two is that the pretrained weights contain only model weights,
and checkpoints, apart from model weights, contain optimizer state, LR scheduler state, RNG state.
and checkpoints, apart from model weights, contain optimizer state, LR scheduler state.
Checkpoints are suitable for dividing the training into parts, for example in order
to divide the training job into shorter stages, or restart training after infrastructure fail.
@ -487,14 +472,13 @@ wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/seresnext
unzip seresnext101_32x4d_pyt_amp_20.06.0.zip
```
To run inference on ImageNet, run:
`python ./main.py --arch se-resnext101-32x4d --evaluate --epochs 1 --pretrained-weights nvidia_se-resnext101-32x4d_200821.pth.tar -b <batch size> <path to imagenet>`
To run inference on JPEG image using pretrained weights:
`python classify.py --arch se-resnext101-32x4d -c fanin --weights nvidia_se-resnext101-32x4d_200821.pth.tar --precision AMP|FP32 --image <path to JPEG image>`
`python classify.py --arch se-resnext101-32x4d --weights nvidia_se-resnext101-32x4d_200821.pth.tar --precision AMP|FP32 --image <path to JPEG image>`
## Performance
@ -508,71 +492,62 @@ The following section shows how to run benchmarks measuring the model performanc
To benchmark training, run:
* For 1 GPU
* FP32
`python ./main.py --arch se-resnext101-32x4d -b <batch_size> --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 <path to imagenet>`
* FP32 (V100 GPUs only)
`python ./launch.py --model se-resnext101-32x4d --precision FP32 --mode benchmark_training --platform DGX1V <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
* TF32 (A100 GPUs only)
`python ./launch.py --model se-resnext101-32x4d --precision TF32 --mode benchmark_training --platform DGXA100 <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
* AMP
`python ./main.py --arch se-resnext101-32x4d -b <batch_size> --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --amp --static-loss-scale 256 --memory-format nhwc <path to imagenet>`
`python ./launch.py --model se-resnext101-32x4d --precision AMP --mode benchmark_training --platform <DGX1V|DGXA100> <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
* For multiple GPUs
* FP32
`python ./multiproc.py --nproc_per_node 8 ./main.py --arch se-resnext101-32x4d -b <batch_size> --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 <path to imagenet>`
* FP32 (V100 GPUs only)
`python ./launch.py --model se-resnext101-32x4d --precision FP32 --mode benchmark_training --platform DGX1V <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
* TF32 (A100 GPUs only)
`python ./multiproc.py --nproc_per_node 8 ./launch.py --model se-resnext101-32x4d --precision TF32 --mode benchmark_training --platform DGXA100 <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
* AMP
`python ./multiproc.py --nproc_per_node 8 ./main.py --arch se-resnext101-32x4d -b <batch_size> --training-only -p 1 --raport-file benchmark.json --amp --static-loss-scale 256 --memory-format nhwc --epochs 1 --prof 100 <path to imagenet>`
`python ./multiproc.py --nproc_per_node 8 ./launch.py --model se-resnext101-32x4d --precision AMP --mode benchmark_training --platform <DGX1V|DGXA100> <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
Each of these scripts will run 100 iterations and save results in the `benchmark.json` file.
Batch size should be picked appropriately depending on the hardware configuration.
| *Platform* | *Precision* | *Batch Size* |
|:----------:|:-----------:|:------------:|
| DGXA100 | AMP | 128 |
| DGXA100 | TF32 | 128 |
| DGX-1 | AMP | 128 |
| DGX-1 | FP32 | 64 |
#### Inference performance benchmark
To benchmark inference, run:
* FP32
* FP32 (V100 GPUs only)
`python ./main.py --arch se-resnext101-32x4d -b <batch_size> -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --evaluate <path to imagenet>`
`python ./launch.py --model se-resnext101-32x4d --precision FP32 --mode benchmark_inference --platform DGX1V <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
* TF32 (A100 GPUs only)
`python ./launch.py --model se-resnext101-32x4d --precision FP32 --mode benchmark_inference --platform DGXA100 <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
* AMP
`python ./main.py --arch se-resnext101-32x4d -b <batch_size> -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --evaluate --amp --memory-format nhwc <path to imagenet>`
`python ./launch.py --model se-resnext101-32x4d --precision AMP --mode benchmark_inference --platform <DGX1V|DGXA100> <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
Each of these scripts will run 100 iterations and save results in the `benchmark.json` file.
Batch size should be picked appropriately depending on the hardware configuration.
| *Platform* | *Precision* | *Batch Size* |
|:----------:|:-----------:|:------------:|
| DGXA100 | AMP | 128 |
| DGXA100 | TF32 | 128 |
| DGX-1 | AMP | 128 |
| DGX-1 | FP32 | 64 |
### Results
Our results were obtained by running the applicable training script in the pytorch-20.06 NGC container.
Our results were obtained by running the applicable training script in the pytorch-20.12 NGC container.
To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
#### Training accuracy results
##### Training accuracy: NVIDIA DGX A100 (8x A100 40GB)
##### Training accuracy: NVIDIA DGX A100 (8x A100 80GB)
| **Epochs** | **Mixed Precision Top1** | **TF32 Top1** |
|:----------:|:------------------------:|:--------------:|
| 90 | 80.03 +/- 0.11 | 79.92 +/- 0.07 |
| 250 | 80.9 +/- 0.08 | 80.98 +/- 0.07 |
| **epochs** | **Mixed Precision Top1** | **TF32 Top1** |
|:------:|:--------------------:|:--------------:|
| 90 | 79.95 +/- 0.09 | 79.97 +/- 0.08 |
##### Training accuracy: NVIDIA DGX-1 (8x V100 16GB)
| **epochs** | **Mixed Precision Top1** | **FP32 Top1** |
|:-:|:-:|:-:|
| 90 | 80.04 +/- 0.10 | 79.93 +/- 0.10 |
| 250 | 80.96 +/- 0.04 | 80.97 +/- 0.09 |
| **Epochs** | **Mixed Precision Top1** | **FP32 Top1** |
|:----------:|:------------------------:|:--------------:|
| 90 | 80.04 +/- 0.07 | 79.93 +/- 0.10 |
| 250 | 80.92 +/- 0.09 | 80.97 +/- 0.09 |
##### Example plots
@ -587,26 +562,29 @@ The following images show a 250 epochs configuration on a DGX-1V.
#### Training performance results
##### Training performance: NVIDIA DGX A100 (8x A100 40GB)
##### Training performance: NVIDIA DGX A100 (8x A100 80GB)
| **GPUs** | **Mixed Precision** | **TF32** | **Mixed Precision Speedup** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **TF32 Strong Scaling** | **TF32 Training Time (90E)** |
|:--------:|:-------------------:|:----------:|:---------------------------:|:----------------------------------:|:---------------------------------------:|:-----------------------:|:----------------------------:|
| 1 | 804 img/s | 360 img/s | 2.22 x | 1.0 x | ~42 hours | 1.0 x | ~94 hours |
| 8 | 5248 img/s | 2665 img/s | 1.96 x | 6.52 x | ~7 hours | 7.38 x | ~13 hours |
|**GPUs**|**Mixed Precision**| **TF32** |**Mixed Precision Speedup**|**Mixed Precision Strong Scaling**|**Mixed Precision Training Time (90E)**|**TF32 Strong Scaling**|**TF32 Training Time (90E)**|
|:------:|:-----------------:|:-----------:|:-------------------------:|:--------------------------------:|:-------------------------------------:|:---------------------:|:--------------------------:|
| 1 | 641.57 img/s |258.75 img/s | 2.48x | 1.00x | ~52 hours | 1.00x | ~129 hours |
| 8 | 4758.40 img/s |2038.03 img/s| 2.33x | 7.42x | ~7 hours | 7.88x | ~17 hours |
##### Training performance: NVIDIA DGX-1 16GB (8x V100 16GB)
|**GPUs**|**Mixed Precision**| **FP32** |**Mixed Precision Speedup**|**Mixed Precision Strong Scaling**|**Mixed Precision Training Time (90E)**|**FP32 Strong Scaling**|**FP32 Training Time (90E)**|
|:------:|:-----------------:|:----------:|:-------------------------:|:--------------------------------:|:-------------------------------------:|:---------------------:|:--------------------------:|
| 1 | 383.15 img/s |130.48 img/s| 2.94x | 1.00x | ~87 hours | 1.00x | ~255 hours |
| 8 | 2695.10 img/s |996.04 img/s| 2.71x | 7.03x | ~13 hours | 7.63x | ~34 hours |
| **GPUs** | **Mixed Precision** | **FP32** | **Mixed Precision Speedup** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **FP32 Strong Scaling** | **FP32 Training Time (90E)** |
|:--------:|:-------------------:|:---------:|:---------------------------:|:----------------------------------:|:---------------------------------------:|:-----------------------:|:----------------------------:|
| 1 | 430 img/s | 133 img/s | 3.21 x | 1.0 x | ~79 hours | 1.0 x | ~252 hours |
| 8 | 2716 img/s | 994 img/s | 2.73 x | 6.31 x | ~13 hours | 7.42 x | ~34 hours |
##### Training performance: NVIDIA DGX-1 32GB (8x V100 32GB)
|**GPUs**|**Mixed Precision**| **FP32** |**Mixed Precision Speedup**|**Mixed Precision Strong Scaling**|**Mixed Precision Training Time (90E)**|**FP32 Strong Scaling**|**FP32 Training Time (90E)**|
|:------:|:-----------------:|:----------:|:-------------------------:|:--------------------------------:|:-------------------------------------:|:---------------------:|:--------------------------:|
| 1 | 364.65 img/s |123.46 img/s| 2.95x | 1.00x | ~92 hours | 1.00x | ~270 hours |
| 8 | 2540.49 img/s |959.94 img/s| 2.65x | 6.97x | ~13 hours | 7.78x | ~35 hours |
| **GPUs** | **Mixed Precision** | **FP32** | **Mixed Precision Speedup** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **FP32 Strong Scaling** | **FP32 Training Time (90E)** |
|:--------:|:-------------------:|:----------:|:---------------------------:|:----------------------------------:|:---------------------------------------:|:-----------------------:|:----------------------------:|
| 1 | 413 img/s | 134 img/s | 3.08 x | 1.0 x | ~82 hours | 1.0 x | ~251 hours |
| 8 | 2572 img/s | 1011 img/s | 2.54 x | 6.22 x | ~14 hours | 7.54 x | ~34 hours |
#### Inference performance results
@ -614,62 +592,65 @@ The following images show a 250 epochs configuration on a DGX-1V.
###### FP32 Inference Latency
| **batch size** | **Throughput Avg** | **Latency Avg** | **Latency 90%** | **Latency 95%** | **Latency 99%** |
|:-:|:-:|:-:|:-:|:-:|:-:|
| 1 | 33.58 img/s | 29.72ms | 30.92ms | 31.77ms | 34.65ms |
| 2 | 66.47 img/s | 29.94ms | 31.30ms | 32.74ms | 34.79ms |
| 4 | 135.31 img/s | 29.36ms | 29.78ms | 32.61ms | 33.90ms |
| 8 | 261.52 img/s | 30.42ms | 32.73ms | 33.99ms | 35.61ms |
| 16 | 356.05 img/s | 44.61ms | 44.93ms | 45.17ms | 46.90ms |
| 32 | 391.83 img/s | 80.91ms | 81.28ms | 81.64ms | 82.69ms |
| 64 | 443.91 img/s | 142.70ms | 142.99ms | 143.46ms | 145.01ms |
| 128 | N/A | N/A | N/A | N/A | N/A |
| **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
|:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
| 1 | 37 img/s | 26.81 ms | 27.89 ms | 31.44 ms |
| 2 | 75 img/s | 27.01 ms | 28.89 ms | 31.17 ms |
| 4 | 144 img/s | 28.09 ms | 30.14 ms | 32.47 ms |
| 8 | 259 img/s | 31.23 ms | 33.65 ms | 38.4 ms |
| 16 | 332 img/s | 48.7 ms | 48.35 ms | 48.8 ms |
| 32 | 394 img/s | 83.02 ms | 81.55 ms | 81.9 ms |
| 64 | 471 img/s | 138.88 ms | 136.24 ms | 136.54 ms |
| 128 | 505 img/s | 261.4 ms | 253.07 ms | 254.29 ms |
| 256 | 513 img/s | 516.66 ms | 496.06 ms | 497.05 ms |
###### Mixed Precision Inference Latency
| **batch size** | **Throughput Avg** | **Latency Avg** | **Latency 90%** | **Latency 95%** | **Latency 99%** |
|:-:|:-:|:-:|:-:|:-:|:-:|
| 1 | 35.08 img/s | 28.40ms | 29.75ms | 31.77ms | 35.85ms |
| 2 | 68.85 img/s | 28.92ms | 30.24ms | 31.46ms | 37.07ms |
| 4 | 131.78 img/s | 30.17ms | 31.39ms | 32.66ms | 37.17ms |
| 8 | 260.21 img/s | 30.52ms | 31.20ms | 32.92ms | 34.46ms |
| 16 | 506.62 img/s | 31.36ms | 32.48ms | 34.13ms | 36.49ms |
| 32 | 778.92 img/s | 40.69ms | 40.90ms | 41.07ms | 43.67ms |
| 64 | 880.49 img/s | 72.10ms | 72.29ms | 72.34ms | 76.46ms |
| 128 | 977.86 img/s | 130.19ms | 130.34ms | 130.41ms | 131.12ms |
| **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
|:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
| 1 | 29 img/s | 34.24 ms | 36.67 ms | 39.4 ms |
| 2 | 53 img/s | 37.81 ms | 43.03 ms | 45.1 ms |
| 4 | 103 img/s | 39.1 ms | 43.05 ms | 46.16 ms |
| 8 | 226 img/s | 35.66 ms | 38.39 ms | 41.13 ms |
| 16 | 458 img/s | 35.4 ms | 37.38 ms | 39.97 ms |
| 32 | 882 img/s | 37.37 ms | 40.12 ms | 42.64 ms |
| 64 | 1356 img/s | 49.31 ms | 47.21 ms | 49.87 ms |
| 112 | 1448 img/s | 81.27 ms | 77.35 ms | 78.28 ms |
| 128 | 1486 img/s | 90.59 ms | 86.15 ms | 87.04 ms |
| 256 | 1534 img/s | 176.72 ms | 166.2 ms | 167.53 ms |
##### Inference performance: NVIDIA T4
###### FP32 Inference Latency
| **batch size** | **Throughput Avg** | **Latency Avg** | **Latency 90%** | **Latency 95%** | **Latency 99%** |
|:-:|:-:|:-:|:-:|:-:|:-:|
| 1 | 40.47 img/s | 24.72ms | 26.94ms | 29.33ms | 33.03ms |
| 2 | 84.16 img/s | 23.66ms | 24.53ms | 25.96ms | 29.42ms |
| 4 | 165.10 img/s | 24.08ms | 24.59ms | 25.75ms | 27.57ms |
| 8 | 266.04 img/s | 29.90ms | 30.51ms | 30.84ms | 33.07ms |
| 16 | 325.89 img/s | 48.57ms | 48.91ms | 49.02ms | 51.01ms |
| 32 | 365.99 img/s | 86.94ms | 87.15ms | 87.41ms | 90.74ms |
| 64 | 410.43 img/s | 155.30ms | 156.07ms | 156.36ms | 164.74ms |
| 128 | N/A | N/A | N/A | N/A | N/A |
| **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
|:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
| 1 | 52 img/s | 19.39 ms | 20.39 ms | 21.18 ms |
| 2 | 102 img/s | 19.98 ms | 21.4 ms | 23.75 ms |
| 4 | 134 img/s | 30.12 ms | 30.14 ms | 30.54 ms |
| 8 | 136 img/s | 59.07 ms | 60.63 ms | 61.49 ms |
| 16 | 154 img/s | 104.38 ms | 105.21 ms | 105.81 ms |
| 32 | 169 img/s | 190.12 ms | 189.64 ms | 190.24 ms |
| 64 | 171 img/s | 376.19 ms | 374.16 ms | 375.6 ms |
| 128 | 168 img/s | 771.4 ms | 761.64 ms | 764.7 ms |
| 256 | 159 img/s | 1639.15 ms | 1603.45 ms | 1605.47 ms |
###### Mixed Precision Inference Latency
| **batch size** | **Throughput Avg** | **Latency Avg** | **Latency 90%** | **Latency 95%** | **Latency 99%** |
|:-:|:-:|:-:|:-:|:-:|:-:|
| 1 | 38.80 img/s | 25.74ms | 26.10ms | 29.28ms | 31.72ms |
| 2 | 78.79 img/s | 25.29ms | 25.83ms | 27.18ms | 33.07ms |
| 4 | 160.22 img/s | 24.81ms | 25.58ms | 26.25ms | 27.93ms |
| 8 | 298.01 img/s | 26.69ms | 27.59ms | 29.13ms | 32.69ms |
| 16 | 567.48 img/s | 28.05ms | 28.36ms | 31.28ms | 34.44ms |
| 32 | 709.56 img/s | 44.58ms | 44.69ms | 44.98ms | 47.99ms |
| 64 | 799.72 img/s | 79.32ms | 79.40ms | 79.49ms | 84.34ms |
| 128 | 856.19 img/s | 147.92ms | 149.02ms | 149.13ms | 151.90ms |
| **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
|:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
| 1 | 42 img/s | 24.17 ms | 27.26 ms | 29.98 ms |
| 2 | 87 img/s | 23.24 ms | 24.66 ms | 26.77 ms |
| 4 | 170 img/s | 23.87 ms | 24.89 ms | 29.59 ms |
| 8 | 334 img/s | 24.49 ms | 27.92 ms | 35.66 ms |
| 16 | 472 img/s | 34.45 ms | 34.29 ms | 35.72 ms |
| 32 | 502 img/s | 64.93 ms | 64.47 ms | 65.16 ms |
| 64 | 517 img/s | 126.24 ms | 125.03 ms | 125.86 ms |
| 128 | 522 img/s | 250.99 ms | 245.87 ms | 247.1 ms |
| 256 | 523 img/s | 502.41 ms | 487.58 ms | 489.69 ms |
## Release notes
@ -681,9 +662,10 @@ The following images show a 250 epochs configuration on a DGX-1V.
2. July 2020
* Added A100 scripts
* Updated README
3. February 2021
* Moved from APEX AMP to Native AMP
### Known issues
There are no known issues with this model.

View file

@ -0,0 +1 @@
python ./launch.py --model se-resnext101-32x4d --precision AMP --mode convergence --platform DGX1V /imagenet --workspace ${1:-./} --raport-file raport.json

View file

@ -0,0 +1 @@
python ./launch.py --model se-resnext101-32x4d --precision AMP --mode convergence --platform DGX1V /imagenet --epochs 90 --mixup 0.0 --workspace ${1:-./} --raport-file raport.json

View file

@ -1 +0,0 @@
python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j8 -p 100 --data-backend pytorch --arch se-resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 128 --amp --static-loss-scale 128 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs 250 --warmup 8 --wd 6.103515625e-05 --mixup 0.2 --memory-format nhwc

View file

@ -1 +0,0 @@
python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j8 -p 100 --data-backend pytorch --arch se-resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 128 --amp --static-loss-scale 128 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs 90 --warmup 8 --wd 6.103515625e-05 --memory-format nhwc

View file

@ -1 +0,0 @@
python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j16 -p 100 --data-backend pytorch --arch se-resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 128 --amp --static-loss-scale 128 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs 90 --warmup 8 --wd 6.103515625e-05 --memory-format nhwc

View file

@ -0,0 +1 @@
python ./launch.py --model se-resnext101-32x4d --precision AMP --mode convergence --platform DGXA100 /imagenet --workspace ${1:-./} --raport-file raport.json

View file

@ -0,0 +1 @@
python ./launch.py --model se-resnext101-32x4d --precision AMP --mode convergence --platform DGXA100 /imagenet --epochs 90 --mixup 0.0 --workspace ${1:-./} --raport-file raport.json

View file

@ -0,0 +1 @@
python ./launch.py --model se-resnext101-32x4d --precision FP32 --mode convergence --platform DGX1V /imagenet --workspace ${1:-./} --raport-file raport.json

View file

@ -0,0 +1 @@
python ./launch.py --model se-resnext101-32x4d --precision FP32 --mode convergence --platform DGX1V /imagenet --epochs 90 --mixup 0.0 --workspace ${1:-./} --raport-file raport.json

View file

@ -1 +0,0 @@
python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j8 -p 100 --data-backend pytorch --arch se-resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 64 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs 250 --warmup 8 --wd 6.103515625e-05 --mixup 0.2

View file

@ -1 +0,0 @@
python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j8 -p 100 --data-backend pytorch --arch se-resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 64 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs 90 --warmup 8 --wd 6.103515625e-05

View file

@ -1 +0,0 @@
python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j16 -p 100 --data-backend pytorch --arch se-resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 128 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs 90 --warmup 8 --wd 6.103515625e-05

View file

@ -0,0 +1 @@
python ./launch.py --model se-resnext101-32x4d --precision TF32 --mode convergence --platform DGXA100 /imagenet --workspace ${1:-./} --raport-file raport.json

View file

@ -0,0 +1 @@
python ./launch.py --model se-resnext101-32x4d --precision TF32 --mode convergence --platform DGXA100 /imagenet --epochs 90 --mixup 0.0 --workspace ${1:-./} --raport-file raport.json

View file

@ -61,6 +61,7 @@ def initialize_model(args):
model.load_state_dict(
{k.replace("module.", ""): v for k, v in state_dict.items()}
)
model.load_state_dict(state_dict)
return model.half() if args.fp16 else model

View file

@ -0,0 +1,23 @@
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
.idea
**/.ipynb_checkpoints
**/__pycache__
**/.gitkeep
.git
.gitignore
Dockerfile*
.dockerignore
README.md

View file

@ -0,0 +1,20 @@
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
.idea/
*.tar
.ipynb_checkpoints
/_python_build
*.pyc
__pycache__

View file

@ -0,0 +1,54 @@
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
ARG FROM_IMAGE_NAME=nvcr.io/nvidia/nvtabular:0.3
FROM ${FROM_IMAGE_NAME}
USER root
# Spark dependencies
ENV APACHE_SPARK_VERSION 2.3.1
ENV HADOOP_VERSION 2.7
RUN apt-get -y update && \
apt-get install --no-install-recommends -y openjdk-8-jre-headless ca-certificates-java time && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
RUN cd /tmp && \
wget -q http://archive.apache.org/dist/spark/spark-${APACHE_SPARK_VERSION}/spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz && \
echo "DC3A97F3D99791D363E4F70A622B84D6E313BD852F6FDBC777D31EAB44CBC112CEEAA20F7BF835492FB654F48AE57E9969F93D3B0E6EC92076D1C5E1B40B4696 *spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz" | sha512sum -c - && \
tar xzf spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz -C /usr/local --owner root --group root --no-same-owner && \
rm spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz
RUN cd /usr/local && ln -s spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION} spark
# Spark config
ENV SPARK_HOME /usr/local/spark
ENV PYTHONPATH $SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:/wd
ENV SPARK_OPTS --driver-java-options=-Xms1024M --driver-java-options=-Xmx4096M --driver-java-options=-Dlog4j.logLevel=info
ENV PYSPARK_PYTHON /conda/envs/rapids/bin/python
ENV PYSPARK_DRIVER_PYTHON /conda/envs/rapids/bin/python
SHELL ["/bin/bash", "-c"]
RUN source activate rapids && \
pip install --upgrade pip && \
pip install --no-cache-dir pyspark==2.3.1 && \
pip install --no-cache-dir --no-deps tensorflow-transform==0.24.1 apache-beam==2.14 tensorflow-metadata==0.14.0 pydot dill \
pip install --no-cache-dir -e git://github.com/NVIDIA/dllogger#egg=dllogger
WORKDIR /wd
COPY . .

View file

@ -0,0 +1,28 @@
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
ARG FROM_IMAGE_NAME=nvcr.io/nvidia/tensorflow:20.12-tf2-py3
FROM ${FROM_IMAGE_NAME}
USER root
RUN pip install --upgrade pip && \
pip install --no-cache-dir --no-deps tensorflow-transform==0.24.1 tensorflow-metadata==0.14.0 pydot dill && \
pip install --no-cache-dir ipdb pynvml==8.0.4 && \
pip install --no-cache-dir -e git://github.com/NVIDIA/dllogger#egg=dllogger
WORKDIR /wd
COPY . .

View file

@ -0,0 +1,833 @@
# Wide & Deep Recommender Model Training For TensorFlow 2
This repository provides a script and recipe to train the Wide and Deep Recommender model to achieve state-of-the-art accuracy.
The content of the repository is tested and maintained by NVIDIA.
- [Model overview](#model-overview)
* [Model architecture](#model-architecture)
* [Applications and dataset](#applications-and-dataset)
* [Default configuration](#default-configuration)
* [Model accuracy metric](#model-accuracy-metric)
* [Feature support matrix](#feature-support-matrix)
+ [Features](#features)
* [Mixed precision training](#mixed-precision-training)
+ [Enabling mixed precision](#enabling-mixed-precision)
+ [Enabling TF32](#enabling-tf32)
* [Glossary](#glossary)
- [Setup](#setup)
* [Requirements](#requirements)
- [Quick Start Guide](#quick-start-guide)
- [Advanced](#advanced)
* [Scripts and sample code](#scripts-and-sample-code)
* [Parameters](#parameters)
* [Command-line options](#command-line-options)
* [Getting the data](#getting-the-data)
+ [Dataset guidelines](#dataset-guidelines)
+ [Dataset preprocessing](#dataset-preprocessing)
- [Spark CPU Dataset preprocessing](#spark-cpu-dataset-preprocessing)
- [NVTabular GPU preprocessing](#nvtabular-gpu-preprocessing)
* [Training process](#training-process)
* [Evaluation process](#evaluation-process)
- [Performance](#performance)
* [Benchmarking](#benchmarking)
+ [NVTabular and Spark CPU Preprocessing comparison](#nvtabular-and-spark-cpu-preprocessing-comparison)
+ [Training and evaluation performance benchmark](#training-and-evaluation-performance-benchmark)
* [Results](#results)
+ [Training accuracy results](#training-accuracy-results)
- [Training accuracy: NVIDIA DGX A100 (8x A100 80GB)](#training-accuracy-nvidia-dgx-a100-8x-a100-80gb)
- [Training accuracy: NVIDIA DGX-1 (8x V100 16GB)](#training-accuracy-nvidia-dgx-1-8x-v100-16gb)
- [Training accuracy plots](#training-accuracy-plots)
- [Training stability test](#training-stability-test)
- [Impact of mixed precision on training accuracy](#impact-of-mixed-precision-on-training-accuracy)
+ [Training performance results](#training-performance-results)
- [Training performance: NVIDIA DGX A100 (8x A100 80GB)](#training-performance-nvidia-dgx-a100-8x-a100-80gb)
- [Training performance: NVIDIA DGX-1 (8x V100 16GB)](#training-performance-nvidia-dgx-1-8x-v100-16gb)
+ [Evaluation performance results](#evaluation-performance-results)
- [Evaluation performance: NVIDIA DGX A100 (8x A100 80GB)](#evaluation-performance-nvidia-dgx-a100-8x-a100-80gb)
- [Evaluation performance: NVIDIA DGX-1 (8x V100 16GB)](#evaluation-performance-nvidia-dgx-1-8x-v100-16gb)
- [Release notes](#release-notes)
* [Changelog](#changelog)
* [Known issues](#known-issues)
## Model overview
Recommendation systems drive engagement on many of the most popular online platforms. As the volume of data available to power these systems grows exponentially, Data Scientists are increasingly turning from more traditional machine learning methods to highly expressive deep learning models to improve the quality of their recommendations.
Google's [Wide & Deep Learning for Recommender Systems](https://arxiv.org/abs/1606.07792) has emerged as a popular model for Click Through Rate (CTR) prediction tasks thanks to its power of generalization (deep part) and memorization (wide part).
The differences between this Wide & Deep Recommender Model and the model from the paper is the size of the deep part of the model. Originally, in Google's paper, the fully connected part was three layers of 1024, 512, and 256 neurons. Our model consists of 5 layers each of 1024 neurons.
This model is trained with mixed precision using Tensor Cores on NVIDIA Volta and NVIDIA Ampere GPU architectures. Therefore, researchers can get results 4.5 times faster than training without Tensor Cores, while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.
### Model architecture
Wide & Deep refers to a class of networks that use the output of two parts working in parallel - wide model and deep model - to make a binary prediction of CTR. The wide model is a linear model of features together with their transforms. The deep model is a series of 5 hidden MLP layers of 1024 neurons. The model can handle both numerical continuous features as well as categorical features represented as dense embeddings. The architecture of the model is presented in Figure 1.
<p align="center">
<img width="100%" src="./img/model.svg">
<br>
Figure 1. The architecture of the Wide & Deep model.</a>
</p>
### Applications and dataset
As a reference dataset, we used a subset of [the features engineered](https://github.com/gabrielspmoreira/kaggle_outbrain_click_prediction_google_cloud_ml_engine) by the 19th place finisher in the [Kaggle Outbrain Click Prediction Challenge](https://www.kaggle.com/c/outbrain-click-prediction/). This competition challenged competitors to predict the likelihood with which a particular ad on a website's display would be clicked on. Competitors were given information about the user, display, document, and ad in order to train their models. More information can be found [here](https://www.kaggle.com/c/outbrain-click-prediction/data).
### Default configuration
The Outbrain Dataset is preprocessed in order to get features input to the model. To give context to the acceleration numbers described below, some important properties of our features and model are as follows.
Features:
- Request Level:
* 5 scalar numeric features `dtype=float32`
* 8 categorical features (all INT32 `dtype`)
* 8 trainable embeddings of (dimension, cardinality of categorical variable): (128,300000), (16,4), (128,100000), (64 ,4000), (64,1000), (64,2500), (64,300), (64,2000)
* 8 trainable embeddings for wide part of size 1 (serving as an embedding from the categorical to scalar space for input to the wide portion of the model)
- Item Level:
* 8 scalar numeric features `dtype=float32`
* 5 categorical features (all INT32 `dtype`)
* 5 trainable embeddings of dimensions (cardinality of categorical variable): 128 (250000), 64 (2500), 64 (4000), 64 (1000),64 (5000)
* 5 trainable embeddings for wide part of size 1 (working as trainable one-hot embeddings)
Features describe both the user (Request Level features) and Item (Item Level Features).
- Model:
* Input dimension is 26 (13 categorical and 13 numerical features)
* Total embedding dimension is 976
* 5 hidden layers each with size 1024
* Total number of model parameter is ~90M
* Output dimension is 1 (`y` is the probability of click given Request-level and Item-level features)
* Loss function: Binary Crossentropy
For more information about feature preprocessing, go to [Dataset preprocessing](#dataset-preprocessing).
### Model accuracy metric
Model accuracy is defined with the [MAP@12](https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)#Mean_average_precision) metric. This metric follows the way of assessing model accuracy in the original [Kaggle Outbrain Click Prediction Challenge](https://www.kaggle.com/c/outbrain-click-prediction/). In this repository, the leaked clicked ads are not taken into account since in industrial setup Data Scientists do not have access to leaked information when training the model. For more information about data leak in Kaggle Outbrain Click Prediction challenge, visit this [blogpost](https://medium.com/unstructured/how-feature-engineering-can-help-you-do-well-in-a-kaggle-competition-part-ii-3645d92282b8) by the 19th place finisher in that competition.
Training and evaluation script also reports AUC ROC, binary accuracy, and Loss (BCE) values.
### Feature support matrix
The following features are supported by this model:
| Feature | Wide & Deep |
| -------------------------------- | ----------- |
| Horovod Multi-GPU (NCCL) | Yes |
| Accelerated Linear Algebra (XLA) | Yes |
| Automatic mixed precision (AMP) | Yes |
#### Features
**Horovod** is a distributed training framework for TensorFlow, Keras, PyTorch and MXNet. The goal of Horovod is to make distributed deep learning fast and easy to use. For more information about how to get started with Horovod, see: [Horovod: Official repository](https://github.com/horovod/horovod).
**Multi-GPU training with Horovod**
Our model uses Horovod to implement efficient multi-GPU training with NCCL. For details, see example sources in this repository or see: [TensorFlow tutorial](https://github.com/horovod/horovod/#usage).
**XLA** is a domain-specific compiler for linear algebra that can accelerate TensorFlow models with potentially no source code changes. Enabling XLA results in improvements to speed and memory usage: most internal benchmarks run ~1.1-1.5x faster after XLA is enabled. For more information on XLA visit [official XLA page](https://www.tensorflow.org/xla).
### Mixed precision training
Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores) in Volta, and following with both the Turing and Ampere architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using [mixed precision training](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html) previously required two steps:
1. Porting the model to use the FP16 data type where appropriate.
2. Adding loss scaling to preserve small gradient values.
The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in [CUDA 8](https://devblogs.nvidia.com/parallelforall/tag/fp16/) in the NVIDIA Deep Learning SDK.
For more information:
* How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html) documentation.
* Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog.
* How to access and enable AMP for TensorFlow, see [Using TF-AMP](https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tfamp) from the TensorFlow User Guide.
For information on the influence of mixed precision training on model accuracy in train and inference, go to [Training accuracy results](Training-accuracy-results).
#### Enabling mixed precision
To enable Wide & Deep training to use mixed precision, add the additional flag `--amp` to the training script. Refer to the [Quick Start Guide](#quick-start-guide) for more information.
#### Enabling TF32
TensorFloat-32 (TF32) is the new math mode in NVIDIA A100 GPUs for handling tensor operations. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on Volta GPUs.
TF32 Tensor Cores can speed up networks using FP32, typically with no loss of accuracy. It is more robust than FP16 for models which require high dynamic range for weights or activations.
For more information, refer to the [TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/) blog post.
TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by default.
### Glossary
**Request level features**
Features that describe the person and context to which we wish to make recommendations.
**Item level features**
Features that describe those objects which we are considering recommending.
## Setup
The following section lists the requirements that you need to meet in order to start training the Wide & Deep model.
### Requirements
This repository contains Dockerfile which extends the TensorFlow2 NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
- [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
- [20.12-tf2-py3](https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow) NGC container
Supported GPUs:
- [NVIDIA Volta architecture](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
- [NVIDIA Turing architecture](https://www.nvidia.com/en-us/design-visualization/technologies/turing-architecture/)
- [NVIDIA Ampere architecture](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/)
For more information about how to get started with NGC containers, see the following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning Documentation:
* [Getting Started Using NVIDIA GPU Cloud](https://docs.nvidia.com/ngc/ngc-getting-started-guide/index.html)
* [Accessing And Pulling From The NGC Container Registry](https://docs.nvidia.com/deeplearning/frameworks/user-guide/index.html#accessing_registry)
* [Running TensorFlow](https://docs.nvidia.com/deeplearning/frameworks/tensorflow-release-notes/running.html#running)
For those unable to use the TensorFlow2 NGC container, to set up the required environment or create their own container, see the versioned [NVIDIA Container Support Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html).
## Quick Start Guide
To train your model using the default parameters of the Wide & Deep model on the Outbrain dataset in TF32 or FP32 precision, perform the following steps. For the specifics concerning training and inference with custom settings, see the [Advanced section](#advanced).
1. Clone the repository.
```
git clone https://github.com/NVIDIA/DeepLearningExamples
```
2. Go to `WideAndDeep` TensorFlow2 directory within the `DeepLearningExamples` repository:
```
cd DeepLearningExamples/TensorFlow2/Recommendation/WideAndDeep
```
3. Download the Outbrain dataset.
The Outbrain dataset can be downloaded from Kaggle (requires Kaggle account). Unzip the downloaded archive (for example, to `/raid/outbrain/orig`) and set the `HOST_OUTBRAIN_PATH` variable to the parent directory:
```
HOST_OUTBRAIN_PATH=/raid/outbrain
```
4. Preprocess the Outbrain dataset.
4.1. Build the Wide & Deep Preprocessing Container.
```
cd DeepLearningExamples/TensorFlow2/Recommendation/WideAndDeep
docker build -f Dockerfile-preproc . -t wd2-prep
```
4.2. Start an interactive session in the Wide&Deep Preprocessing Container. Run preprocessing against the original Outbrain dataset to `tf_records`. You can run preprocessing using Spark (CPU) or NVTabular preprocessing (GPU).
```
nvidia-docker run --rm -it --ipc=host -v ${HOST_OUTBRAIN_PATH}:/outbrain wd2-prep bash
```
4.3. Start preprocessing.
You can preprocess the data using either Spark on CPU or NVTabular on GPU. For more information, go to the [Dataset preprocessing](#dataset-preprocessing) section.
4.3.1. CPU Preprocessing (Spark).
```
cd /wd && bash scripts/preproc.sh spark 40
```
4.3.2. GPU Preprocessing (NVTabular).
```
cd /wd && bash scripts/preproc.sh nvtabular 40
```
The result of preprocessing scripts are prebatched TFRecords. The argument to the script is the number of TFRecords files that will be generated by the script (here 40). TFRecord files are generated in `${HOST_OUTBRAIN_PATH}/tfrecords`.
4.4. Training of the model
4.4.1. Build the Wide&Deep Training Container
```
cd DeepLearningExamples/TensorFlow2/Recommendation/WideAndDeep
docker build -f Dockerfile-train . -t wd2-train
```
4.4.2. Start an interactive session in the Wide&Deep Training Container
```
nvidia-docker run --rm -it --privileged --ipc=host -v ${HOST_OUTBRAIN_PATH}:/outbrain wd2-train bash
```
4.4.3. Run training
For 1 GPU:
```
python main.py
```
For 1 GPU with Mixed Precision training with XLA:
```
python main.py --xla --amp
```
For complete usage, run:
```
python main.py -h
```
For 8 GPUs:
```
mpiexec --allow-run-as-root --bind-to socket -np 8 python main.py
```
For 8 GPU with Mixed Precision training with XLA:
```
mpiexec --allow-run-as-root --bind-to socket -np 8 python main.py --xla --amp
```
5. Run validation or evaluation.
If you want to run validation or evaluation, you can either:
* use the checkpoint obtained from the training commands above, or
* download the pretrained checkpoint from NGC.
In order to download the checkpoint from NGC, visit [ngc.nvidia.com](https://ngc.nvidia.com) website and browse the available models. Download the checkpoint files and unzip them to some path, for example, to `$HOST_OUTBRAIN_PATH/checkpoints/` (which is the default path for storing the checkpoints during training). The checkpoint requires around 700MB disk space.
6. Start validation/evaluation.
In order to validate the checkpoint on the evaluation set, run the `main.py` script with the `--evaluate` and `--use_checkpoint` flags.
For 1 GPU:
```
python main.py --evaluate --use_checkpoint
```
For 8 GPUs:
```
mpiexec --allow-run-as-root --bind-to socket -np 8 python main.py --evaluate --use_checkpoint
```
Now that you have your model trained and evaluated, you can choose to compare your training results with our [Training accuracy results](#training-accuracy-results). You can also choose to benchmark yours performance to [Training and evaluation performance benchmark](#training-and-evaluation-performance-benchmark). Following the steps in these sections will ensure that you achieve the same accuracy and performance results as stated in the [Results](#results) section.
## Advanced
The following sections provide greater details of the dataset, running training, and the training results.
### Scripts and sample code
These are the important scripts in this repository:
* `main.py` - Python script for training the Wide & Deep recommender model. This script is run inside the training container (named `wd-train` in the [Quick Start Guide](#quick-start-guide)).
* `scripts/preproc.sh` - Bash script for Outbrain dataset preparation for training, preprocessing and saving into TFRecords format. This script is run inside a preprocessing container (named `wd-prep` in the [Quick Start Guide](#quick-start-guide)).
* `data/outbrain/dataloader.py` - Python file containing data loaders for training and evaluation set.
* `data/outbrain/features.py` - Python file describing the request and item level features as well as embedding dimensions and hash buckets sizes.
* `trainer/model/widedeep.py` - Python file with model definition.
* `trainer/utils/run.py` - Python file with training loop.
### Parameters
These are the important parameters in the `main.py` script:
| Scope| parameter| Comment| Default Value |
| -------------------- | ----------------------------------------------------- | ------------------------------------------------------------ | ------------- |
| location of datasets | --transformed_metadata_path TRANSFORMED_METADATA_PATH | Path to transformed_metadata for feature specification reconstruction | |
|location of datasets| --use_checkpoint|Use checkpoint stored in model_dir path |False
|location of datasets|--model_dir MODEL_DIR|Destination where model checkpoint will be saved |/outbrain/checkpoints
|location of datasets|--results_dir RESULTS_DIR|Directory to store training results | /results
|location of datasets|--log_filename LOG_FILENAME|Name of the file to store dlloger output |log.json|
|training parameters|--training_set_size TRAINING_SET_SIZE|Number of samples in the training set | 59761827
|training parameters|--global_batch_size GLOBAL_BATCH_SIZE|Total size of training batch | 131072
|training parameters|--eval_batch_size EVAL_BATCH_SIZE|Total size of evaluation batch | 131072
|training parameters|--num_epochs NUM_EPOCHS|Number of training epochs | 20
|training parameters|--cpu|Run computations on the CPU | False
|training parameters|--amp|Enable automatic mixed precision conversion | False
|training parameters|--xla|Enable XLA conversion | False
|training parameters|--linear_learning_rate LINEAR_LEARNING_RATE|Learning rate for linear model | 0.02
|training parameters|--deep_learning_rate DEEP_LEARNING_RATE|Learning rate for deep model | 0.00012
|training parameters|--deep_warmup_epochs DEEP_WARMUP_EPOCHS|Number of learning rate warmup epochs for deep model | 6
|model construction|--deep_hidden_units DEEP_HIDDEN_UNITS [DEEP_HIDDEN_UNITS ...]|Hidden units per layer for deep model, separated by spaces|[1024, 1024, 1024, 1024, 1024]
|model construction|--deep_dropout DEEP_DROPOUT|Dropout regularization for deep model|0.1
|run mode parameters|--evaluate|Only perform an evaluation on the validation dataset, don't train | False
|run mode parameters|--benchmark|Run training or evaluation benchmark to collect performance metrics | False
|run mode parameters|--benchmark_warmup_steps BENCHMARK_WARMUP_STEPS|Number of warmup steps before start of the benchmark | 500
|run mode parameters|--benchmark_steps BENCHMARK_STEPS|Number of steps for performance benchmark | 1000
|run mode parameters|--affinity{socket,single,single_unique,<br>socket_unique_interleaved,<br>socket_unique_continuous,disabled}|Type of CPU affinity | socket_unique_interleaved
### Command-line options
To see the full list of available options and their descriptions, use the `-h` or `--help` command-line option:
```
python main.py -h
```
### Getting the data
The Outbrain dataset can be downloaded from [Kaggle](https://www.kaggle.com/c/outbrain-click-prediction/data) (requires Kaggle account).
#### Dataset guidelines
The dataset contains a sample of users page views and clicks, as observed on multiple publisher sites. Viewed pages and clicked recommendations have additional semantic attributes of the documents. The dataset contains sets of content recommendations served to a specific user in a specific context. Each context (i.e. a set of recommended ads) is given a `display_id`. In each such recommendation set, the user has clicked on exactly one of the ads.
The original data is stored in several separate files:
* `page_views.csv` - log of users visiting documents (2B rows, ~100GB uncompressed)
* `clicks_train.csv` - data showing which ad was clicked in each recommendation set (87M rows)
* `clicks_test.csv` - used only for the submission in the original Kaggle contest
* `events.csv` - metadata about the context of each recommendation set (23M rows)
* `promoted_content.csv` - metadata about the ads
* `document_meta.csv`, `document_topics.csv`, `document_entities.csv`, `document_categories.csv` - metadata about the documents
During the preprocessing stage, the data is transformed into 87M rows tabular data of 26 features. The dataset is split into training and evaluation parts that have approx 60M and approx 27M rows, respectively. Splitting into train and eval is done in this way so that random 80% of daily events for the first 10 days of the dataset form a training set and remaining part (20% of events daily for the first 10 days and all events in the last two days) form an evaluation set. Eventually the dataset is saved in pre-batched TFRecord format.
#### Dataset preprocessing
Dataset preprocessing aims in creating in total 26 features: 13 categorical and 13 numerical. These features are obtained from the original Outbrain dataset in preprocessing. There are 2 types of preprocessing available for the model:
Spark CPU preprocessing
[NVTabular](https://nvidia.github.io/NVTabular/v0.3.0/index.html) GPU preprocessing
Both split the dataset into train and evaluation sets and produce the same feature set, therefore, the training is agnostic to the preprocessing step.
For comparison of Spark CPU and NVTabular preprocessing go to [NVTabular and Spark CPU Preprocessing comparison](#nvtabular-and-spark-cpu-preprocessing-comparison)
##### Spark CPU Dataset preprocessing
The original dataset is preprocessed using the scripts provided in `data/outbrain/spark`. Preprocessing is split into 3 preprocessing steps: `preproc1.py`, `preproc2.py`, and `preproc3.py` that form a complete workflow. The workflow consists of the following operations:
* separating out the validation set for cross-validation
* filling missing data with mode, median, or imputed values
* joining click data, ad metadata, and document category, topic and entity tables to create an enriched table
* computing 7 click-through rates (CTR) for ads grouped by 7 features
* computing attribute cosine similarity between the landing page and ad to be featured on the page
* math transformations of the numeric features (logarithmic, scaling, binning)
* categorifying data using hash-bucketing
* storing the resulting set of features in pre-batched TFRecord format
The `preproc1-3.py` preprocessing scripts use PySpark. In the Docker image, we have installed Spark 2.3.1 as a standalone cluster of Spark. The `preproc1.py` script splits the data into a training set and a validation set. The `preproc2.py` script computes the click-through rates (CTR) and cosine similarities between the features. The `preproc3.py` script performs the math transformations and generates the final TFRecord files. The data in the output files is pre-batched (with the default batch size of 4096) to avoid the overhead of the TFRecord format, which otherwise is not suitable for the tabular data.
The preprocessing includes some very resource-exhaustive operations including joining tables having over 2 billions of rows. Such operations may not fit into the RAM memory, and therefore we use Spark which is well suited for handling tabular operations on large data with limited RAM. Note that the Spark job requires about 500 GB disk space and 300 GB RAM to perform the preprocessing.
For more information about Spark, refer to the [Spark documentation](https://spark.apache.org/docs/2.3.1/).
##### NVTabular GPU preprocessing
The NVTabular dataset is preprocessed using the script provided in `data/outbrain/nvtabular`. The workflow consists of most of the same operations in the Spark pipeline:
* separating out the validation set for cross-validation
* filling missing data with themode, median, or imputed values most frequent value
* joining click data, ad metadata, and document category, topic and entity tables to create an enriched table.joining the tables for the ad clicks data
* computing 7 click-through rates (CTR) for ads grouped by 7 features different contexts
* computing attribute cosine similarity between the landing page and ad to be featured on the page features of the clicked ads and the viewed ads
* math transformations of the numeric features (logarithmic, normalization)
* categorifying data using hash-bucketing
* storing the result in a Parquet format
**Transforming the result into the pre-batched TFRecord format**
Most of the code describing operations in this workflow are in `data/outbrain/nvtabular/utils/workflow.py` and leverage NVTabular v0.3. As stated in its repository, [NVTabular](https://github.com/NVIDIA/NVTabular), a component of [NVIDIA Merlin Open Beta](https://developer.nvidia.com/nvidia-merlin), is a feature engineering and preprocessing library for tabular data that is designed to quickly and easily manipulate terabyte scale datasets and train deep learning based recommender systems. It provides a high-level abstraction to simplify code and accelerates computation on the GPU using the [RAPIDS Dask-cuDF](https://github.com/rapidsai/cudf/tree/main/python/dask_cudf) library. The code to transform the NVTabular Parquet output into TFRecords is in `data/outbrain/nvtabular/utils/converter.py`.
The NVTabular version of preprocessing is not subject to the same memory and storage constraints as its Spark counterpart, since NVTabular is able to manipulate tables on GPU and work with tables much larger than even physical RAM memory. The NVTabular Outbrain workflow has been successfully tested on DGX-1 V100 and DGX A100 for single and multigpu preprocessing.
For more information about NVTabular, refer to the [NVTabular documentation](https://github.com/NVIDIA/NVTabular).
### Training process
The training can be started by running the `main.py` script. By default, the script is in train mode. Other training related configs are also present in the `trainer/utils/arguments.py` and can be seen using the command `python main.py -h`. Training happens on a TFRecords training dataset files that match `--train_data_pattern`. Training is run for `--num_epochs` epochs with a global batch size of `--global_batch_size` in strong scaling mode (i.e. the effective batch size per GPU equals `global_batch_size/gpu_count`).
The model:
`tf.keras.experimental.WideDeepModel` consists of a wide part and deep part with a sigmoid activation in the output layer (see [Figure 1](#model-architecture)) for reference and `trainer/model/widedeep.py` for model definition).
During training (default configuration):
Two separate optimizers are used to optimize the wide and the deep part of the network:
* FTLR (Follow the Regularized Leader) optimizer is used to optimize the wide part of the network.
* RMSProp optimizer is used to optimize the deep part of the network.
Checkpoint of the model:
* Can be loaded at the beginning of training when `--use_checkpoint` is set
* is saved into `--model_dir` after each training epoch. Only the last checkpoint is kept.
* Contains information about number of training epochs
The model is evaluated on an evaluation dataset after every training epoch training log is displayed in the console and stored in `--log_filename`.
Every 100 batches with training metrics:
loss, binary accuracy, AUC ROC, MAP@12 value
Every epoch after training, evaluation metrics are logged:
loss, binary accuracy, AUC ROC, MAP@12 value
### Evaluation process
The evaluation can be started by running the `main.py --evaluation` script. Evaluation is done for TFRecords dataset stored in `--eval_data_pattern`. Other evaluation related configs are also present in the `trainer/utils/arguments.py` and can be seen using the command `python main.py -h`.
During evaluation (`--evaluation flag`):
* Model is restored from checkpoint in `--model_dir` if `--use_checkpoint` is set
* Evaluation log is displayed in console and stored in `--log_filename`
* Every 100 batches evaluation metrics are logged - loss, binary accuracy, AUC ROC, MAP@12 value
After the whole evaluation, the total evaluation metrics are logged, loss, binary accuracy, AUC ROC, MAP@12 value.
## Performance
### Benchmarking
The following section shows how to run benchmarks measuring the model performance in training mode.
#### NVTabular and Spark CPU Preprocessing comparison
Two types of dataset preprocessing are presented in Spark-CPU and NVTabular on GPU repositories. Both of these preprocess return prebatched TFRecords files with the same structure. The following table shows the comparison of both preprocessing in terms of code complication (Lines of code), top RAM consumption, and preprocessing time.
| |CPU preprocessing | CPU Preprocessing | GPU preprocessing | GPU Preprocessing | GPU preprocessing | GPU Preprocessing |
| -------------------------- | ----- | --------------------- | ------------------------ | ------------------------ | ------------------------ | ------------------------|
|| Spark on NVIDIA DGX-1 | Spark on NVIDIA DGX A100 | NVTabular on DGX-1 1 GPU | NVTabular on DGX-1 8 GPU | NVTabular DGX A100 1 GPU | NVTabular DGX A100 8 GPU | |
| Lines of code* | ~1500 | ~1500| ~500| ~500| ~500| ~500|
| Top RAM consumption \[GB\] | 167.0 | 223.4| 34.3| 48.7| 37.7 | 50.6|
| Top VRAM consumption per GPU \[GB\] | 0 | 0 | 16 | 13 | 45 | 67|
| Preprocessing time \[min\] |45.6|38.5|4.4|3.9|2.6| 2.3|
To achieve the same results for Top RAM consumption and preprocessing time, run a preprocessing container (`${HOST_OUTBRAIN_PATH}` is the path with Outbrain dataset).
```
nvidia-docker run --rm -it --ipc=host -v ${HOST_OUTBRAIN_PATH}:/outbrain wd2-prep bash
```
In the preprocessing container, run the preprocessing benchmark.
For Spark CPU preprocessing:
```
cd /wd && bash scripts/preproc_benchmark.sh -m spark
```
For GPU NVTabular preprocessing:
```
cd /wd && bash scripts/preproc_benchmark.sh -m nvtabular
```
#### Training and evaluation performance benchmark
Benchmark script is prepared to measure performance of the model during training (default configuration) and evaluation (`--evaluation`). Benchmark runs training or evaluation for `--benchmark_steps` batches, however measurement of performance starts after `--benchmark_warmup_steps`. Benchmark can be run for single and 8 GPUs and with a combination of XLA (`--xla`), AMP (`--amp`), batch sizes (`--global_batch_size` , `--eval_batch_size`) and affinity (`--affinity`).
In order to run benchmark follow these steps:
Run training container (`${HOST_OUTBRAIN_PATH}` is the path with Outbrain dataset):
```
nvidia-docker run --rm -it --ipc=host --privileged -v ${HOST_OUTBRAIN_PATH}:/outbrain wd2-train bash
```
Run the benchmark script:
For 1 GPU:
```
python main.py --benchmark
```
The benchmark will be run for training with default training parameters.
For 8GPUs:
```
mpiexec --allow-run-as-root --bind-to socket -np 8 python main.py --benchmark
```
### Results
The following sections provide details on how we achieved our performance and accuracy in training.
#### Training accuracy results
##### Training accuracy: NVIDIA DGX A100 (8x A100 80GB)
Our results were obtained by running the `main.py` training script in the TensorFlow2 NGC container on NVIDIA DGX A100 with (8x A100 80GB) GPUs.
| GPUs | Batch size / GPU | XLA | Accuracy - TF32 (MAP@12), Spark dataset | Accuracy - mixed precision (MAP@12),Spark Dataset | Accuracy - TF32 (MAP@12), NVTabular dataset | Accuracy - mixed precision (MAP@12), NVTabular Dataset | Time to train - TF32 (minutes) | Time to train - mixed precision (minutes) | Time to train speedup (TF32 to mixed precision) |
| ---- | ---------------- | --- | --------------|---|------- | ----------------------------------- | ------------------------------ | ----------------------------------------- | ----------------------------------------------- |
1|131072|Yes|0.65536|0.65537|0.65537|0.65646|16.40|13.71|1.20
1|131072|No|0.65538|0.65533|0.65533|0.65643|19.58|18.49|1.06
8|16384|Yes|0.65527|0.65525|0.65525|0.65638|7.77|9.71|0.80
8|16384|No|0.65517|0.65525|0.65525|0.65638|7.84|9.48|0.83
To achieve the same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
##### Training accuracy: NVIDIA DGX-1 (8x V100 16GB)
Our results were obtained by running the main.py training script in the TensorFlow2 NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs.
| GPUs | Batch size / GPU | XLA | Accuracy - TF32 (MAP@12), Spark dataset | Accuracy - mixed precision (MAP@12),Spark Dataset | Accuracy - TF32 (MAP@12), NVTabular dataset | Accuracy - mixed precision (MAP@12), NVTabular Dataset | Time to train - TF32 (minutes) | Time to train - mixed precision (minutes) | Time to train speedup (TF32 to mixed precision) |
| ---- | ---------------- | --- | --------------|---|------- | ----------------------------------- | ------------------------------ | ----------------------------------------- | ----------------------------------------------- |
1|131072|Yes|0.65531|0.65529|0.65529|0.65651|66.01|23.66|2.79
1|131072|No|0.65542|0.65534|0.65534|0.65641|72.68|29.18|2.49|
8|16384|Yes|0.65544|0.65547|0.65547|0.65642|16.28|13.90|1.17|
8|16384|No|0.65548|0.65540|0.65540|0.65633|16.34|12.65|1.29| |
To achieve the same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
##### Training accuracy plots
Models trained with FP32, TF32 and Automatic Mixed Precision (AMP), with and without XLA enabled achieve similar accuracy.
The plot represents MAP@12 in a function of steps (step is single batch) during training for default precision (FP32 for Volta architecture (DGX-1) and TF32 for Ampere GPU architecture (DGX-A100)) and AMP for XLA and without it for both datasets. All other parameters of training are default.
<p align="center">
<img width="100%" src="./img/leraning_curve_spark.svg" />
<br>
Figure 2. Learning curves for Spark dataset for different configurations.</a>
</p>
<p align="center">
<img width="100%" src="./img/learning_curve_nvt.svg" />
<br>
Figure 3. Learning curves for NVTabular dataset for different configurations.</a>
</p>
##### Training stability test
Training of the model is stable for multiple configurations achieving the standard deviation of 10e-4. The model achieves similar MAP@12 scores for A100 and V100, training precisions, XLA usage and single/multi GPU. The Wide and Deep model was trained for 9100 training steps (20 epochs, 455 batches in each epoch, every batch containing 131072), starting from 20 different initial random seeds for each setup. The training was performed in the 20.12-tf1-py3 NGC container on NVIDIA DGX A100 80GB and DGX-1 16GB machines with and without mixed precision enabled, with and without XLA enabled for Spark- and NVTabular generated datasets. The provided charts and numbers consider single and 8 GPU training. After training, the models were evaluated on the validation set. The following plots compare distributions of MAP@12 on the evaluation set. In columns there is single vs 8 GPU training, in rows DGX A100 and DGX-1 V100.
<p align="center">
<img width="100%" src="./img/training_stability_spark.svg" />
<br>
Figure 4. Training stability for Spark dataset: distribution of MAP@12 across different configurations. 'All configurations' refer to the distribution of MAP@12 for cartesian product of architecture, training precision, XLA usage, single/multi GPU. </a>
</p>
<p align="center">
<img width="100%" src="./img/training_stability_nvtabular.svg" />
<br>
Figure 5. Training stability for NVtabular dataset: distribution of MAP@12 across different configurations. 'All configurations' refer to the distribution of MAP@12 for cartesian product of architecture, training precision, XLA usage, single/multi GPU.</a>
</p>
Training stability was also compared in terms of point statistics for MAP@12 distribution for multiple configurations. Refer to the expandable table below.
<details>
<summary>Full tabular data for training stability tests</summary>
||GPUs|Precicision|Dataset|XLA|mean|std|Min|Max
|--------|-|---------|-----------|---|----|---|---|---
DGX A100|1|TF32|Spark preprocessed|Yes|0.65536|0.00016|0.65510|0.65560|
DGX A100|1|TF32|Spark preprocessed|No|0.65538|0.00013|0.65510|0.65570|
DGX A100|1|TF32|NVTabular preprocessed|Yes|0.65641|0.00038|0.65530|0.65680|
DGX A100|1|TF32|NVTabular preprocessed|No|0.65648|0.00024|0.65580|0.65690|
DGX A100|1|AMP|Spark preprocessed|Yes|0.65537|0.00013|0.65510|0.65550|
DGX A100|1|AMP|Spark preprocessed|No|0.65533|0.00016|0.65500|0.65550|
DGX A100|1|AMP|NVTabular preprocessed|Yes|0.65646|0.00036|0.65530|0.65690|
DGX A100|1|AMP|NVTabular preprocessed|No|0.65643|0.00027|0.65590|0.65690|
DGX A100|8|TF32|Spark preprocessed|Yes|0.65527|0.00013|0.65500|0.65560|
DGX A100|8|TF32|Spark preprocessed|No|0.65517|0.00025|0.65460|0.65560|
DGX A100|8|TF32|NVTabular preprocessed|Yes|0.65631|0.00038|0.65550|0.65690|
DGX A100|8|TF32|NVTabular preprocessed|No|0.65642|0.00022|0.65570|0.65680|
DGX A100|8|AMP|Spark preprocessed|Yes|0.65525|0.00018|0.65490|0.65550|
DGX A100|8|AMP|Spark preprocessed|No|0.65525|0.00016|0.65490|0.65550|
DGX A100|8|AMP|NVTabular preprocessed|Yes|0.65638|0.00026|0.65580|0.65680|
DGX A100|8|AMP|NVTabular preprocessed|No|0.65638|0.00031|0.65560|0.65700|
DGX-1 V100|1|FP32|Spark preprocessed|Yes|0.65531|0.00017|0.65490|0.65560|
DGX-1 V100|1|FP32|Spark preprocessed|No|0.65542|0.00012|0.65520|0.65560|
DGX-1 V100|1|FP32|NVTabular preprocessed|Yes|0.65651|0.00019|0.65610|0.65680|
DGX-1 V100|1|FP32|NVTabular preprocessed|No|0.65638|0.00035|0.65560|0.65680|
DGX-1 V100|1|AMP|Spark preprocessed|Yes|0.65529|0.00015|0.65500|0.65570|
DGX-1 V100|1|AMP|Spark preprocessed|No|0.65534|0.00015|0.65500|0.65560|
DGX-1 V100|1|AMP|NVTabular preprocessed|Yes|0.65651|0.00028|0.65560|0.65690|
DGX-1 V100|1|AMP|NVTabular preprocessed|No|0.65641|0.00032|0.65570|0.65680|
DGX-1 V100|8|FP32|Spark preprocessed|Yes|0.65544|0.00019|0.65500|0.65580|
DGX-1 V100|8|FP32|Spark preprocessed|No|0.65548|0.00013|0.65510|0.65560|
DGX-1 V100|8|FP32|NVTabular preprocessed|Yes|0.65645|0.00012|0.65630|0.65670|
DGX-1 V100|8|FP32|NVTabular preprocessed|No|0.65638|0.00015|0.65610|0.65670|
DGX-1 V100|8|AMP|Spark preprocessed|Yes|0.65547|0.00015|0.65520|0.65580|
DGX-1 V100|8|AMP|Spark preprocessed|No|0.65540|0.00019|0.65500|0.65580|
DGX-1 V100|8|AMP|NVTabular preprocessed|Yes|0.65642|0.00028|0.65580|0.65690|
DGX-1 V100|8|AMP|NVTabular preprocessed|No|0.65633|0.00037|0.65510|0.65680|
</details>
##### Impact of mixed precision on training accuracy
The accuracy of training, measured with [MAP@12](https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)#Mean_average_precision) on the evaluation set after the final epoch metric was not impacted by enabling mixed precision. The obtained results were statistically similar. The similarity was measured according to the following procedure:
The model was trained 20 times for default settings (FP32 or TF32 for Volta and Ampere architecture respectively) and 20 times for AMP. After the last epoch, the accuracy score MAP@12 was calculated on the evaluation set.
Distributions for four configurations: architecture (A100, V100) and single/multi gpu for 2 datasets are presented below.
<p align="center">
<img width="100%" src="./img/amp_influence_spark.svg" />
<br>
Figure 6. Influence of AMP on MAP@12 distribution for DGX A100 and DGX-1 V100 for single and multi gpu training on Spark dataset. </a>
</p>
<p align="center">
<img width="100%" src="./img/amp_influence_nvtabular.svg" />
<br>
Figure 7. Influence of AMP on MAP@12 distribution for DGX A100 and DGX-1 V100 for single and multi gpu training on NVTabular dataset.
</p>
Distribution scores for full precision training and AMP training were compared in terms of mean, variance and [KolmogorovSmirnov test](https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test) to state statistical difference between full precision and AMP results. Refer to the expandable table below.
<details>
<summary>Full tabular data for AMP influence on MAP@12</summary>
| |GPUs | Dataset | XLA | Mean MAP@12 for Full precision (TF32 for A100, FP32 for V100) | Std MAP@12 for Full precision (TF32 for A100, FP32 for V100) | Mean MAP@12 for AMP | Std MAP@12 for AMP | KS test value: statistics, p-value |
| ------------ | ---------------------- | ------- | ------ | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------- | ------------------ | ---------------------------------- |
| DGX A100 | 1 | NVTabular preprocessed | No | 0.6565 | 0.0002 | 0.6564 | 0.0003 | 0.2000 (0.8320) | |
| DGX A100 | 8 | NVTabular preprocessed | No | 0.6564 | 0.0002 | 0.6564 | 0.0003 | 0.1500 (0.9831) | |
| DGX A100 | 1 | Spark preprocessed | No | 0.6554 | 0.0001 | 0.6553 | 0.0002 | 0.2500 (0.5713) | |
| DGX A100 | 8 | Spark preprocessed | No | 0.6552 | 0.0002 | 0.6552 | 0.0002 | 0.3000 (0.3356) | |
| DGX A100 | 1 | NVTabular preprocessed | No | 0.6564 | 0.0004 | 0.6565 | 0.0004 | 0.1500 (0.9831) | |
| DGX A100 | 8 | NVTabular preprocessed | No | 0.6563 | 0.0004 | 0.6564 | 0.0003 | 0.2500 (0.5713) | |
| DGX A100 | 1 | Spark preprocessed | No | 0.6554 | 0.0002 | 0.6554 | 0.0001 | 0.1500 (0.9831) | |
| DGX A100 | 8 | Spark preprocessed | No | 0.6553 | 0.0001 | 0.6552 | 0.0002 | 0.1500 (0.9831)) | |
| DGX-1 V100 | 1 | NVTabular preprocessed | No | 0.6564 | 0.0004 | 0.6564 | 0.0003 | 0.1000 (1.0000) | |
| DGX-1 V100 | 8 | NVTabular preprocessed | No | 0.6564 | 0.0001 | 0.6563 | 0.0004 | 0.2500 (0.5713) | |
| DGX-1 V100 | 1 | Spark preprocessed | No | 0.6554 | 0.0001 | 0.6553 | 0.0001 | 0.2000 (0.8320) | |
| DGX-1 V100 | 8 | Spark preprocessed | No | 0.6555 | 0.0001 | 0.6554 | 0.0002 | 0.3500 (0.1745) | |
| DGX-1 V100 | 1 | NVTabular preprocessed | No | 0.6565 | 0.0002 | 0.6565 | 0.0003 | 0.1500 (0.9831) | |
| DGX-1 V100 | 8 | NVTabular preprocessed | No | 0.6564 | 0.0001 | 0.6564 | 0.0003 | 0.2000 (0.8320) | |
| DGX-1 V100 | 1 | Spark preprocessed | No | 0.6553 | 0.0002 | 0.6553 | 0.0002 | 0.2000 (0.8320) | |
| DGX-1 V100 | 8 | Spark preprocessed | No | 0.6554 | 0.0002 | 0.6555 | 0.0002 | 0.1500 (0.9831) | |
</details>
#### Training performance results
##### Training performance: NVIDIA DGX A100 (8x A100 80GB)
Our results were obtained by running the benchmark script (`main.py --benchmark`) in the TensorFlow2 NGC container on NVIDIA DGX A100 with (8x A100 80GB) GPUs.
|GPUs | Batch size / GPU | XLA | Throughput - TF32 (samples/s)|Throughput - mixed precision (samples/s)|Throughput speedup (TF32 - mixed precision)| Strong scaling - TF32|Strong scaling - mixed precision
| ---- | ---------------- | --- | ----------------------------- | ---------------------------------------- | ------------------------------------------- | --------------------- | -------------------------------- |
|1|131,072|Yes|1642892|1997414|1.22|1.00|1.00|
|1|131,072|No|1269638|1355523|1.07|1.00|1.00|
|8|16,384|Yes|3376438|2508278|0.74|2.06|1.26|
|8|16,384|No|3351118|2643009|0.79|2.64|1.07|
##### Training performance: NVIDIA DGX-1 (8x V100 16GB)
Our results were obtained by running the benchmark script (`main.py --benchmark`) in the TensorFlow2 NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs.
|GPUs | Batch size / GPU | XLA | Throughput - FP32 (samples/s)|Throughput - mixed precision (samples/s)|Throughput speedup (FP32 - mixed precision)| Strong scaling - FP32|Strong scaling - mixed precision
| ---- | ---------------- | --- | ----------------------------- | ---------------------------------------- | ------------------------------------------- | --------------------- | -------------------------------- |
|1|131,072|Yes|361202|1091584|3.02|1.00|1.00
|1|131,072|No|321816|847229|2.63|1.00|1.00
|8|16,384|Yes|1512691|1731391|1.14|4.19|1.59
|8|16,384|No|1490044|1837962|1.23|4.63|2.17
#### Evaluation performance results
##### Evaluation performance: NVIDIA DGX A100 (8x A100 80GB)
Our results were obtained by running the benchmark script (`main.py --evaluate --benchmark`) in the TensorFlow2 NGC container on NVIDIA DGX A100 with 8x A100 80GB GPUs.
|GPUs|Batch size / GPU|XLA|Throughput \[samples/s\] TF32|Throughput \[samples/s\]AMP|Throughput speedup AMP to TF32
|----|----------------|---|------------------------------|-----------------------------|-------------------------------
|1|4096|NO|648058|614053|0.95|
|1|8192|NO|1063986|1063203|1.00|
|1|16384|NO|1506679|1573248|1.04|
|1|32768|NO|1983238|2088212|1.05|
|1|65536|NO|2280630|2523812|1.11|
|1|131072|NO|2568911|2915340|1.13|
|8|4096|NO|4516588|4374181|0.97|
|8|8192|NO|7715609|7718173|1.00|
|8|16384|NO|11296845|11624159|1.03|
|8|32768|NO|14957242|15904745|1.06|
|8|65536|NO|17671055|19332987|1.09|
|8|131072|NO|19779711|21761656|1.10|
For more results go to the expandable table below.
<details>
<summary>Full tabular data for evaluation performance results for DGX A100</summary>
|GPUs|Batch size / GPU|XLA|Throughput \[samples/s\] TF32|Throughput \[samples/s\]AMP|Throughput speedup AMP to TF32
|----|----------------|---|------------------------------|-----------------------------|-------------------------------
|1|4096|YES|621024|648441|1.04|
|1|4096|NO|648058|614053|0.95|
|1|8192|YES|1068943|1045790|0.98|
|1|8192|NO|1063986|1063203|1.00|
|1|16384|YES|1554101|1710186|1.10|
|1|16384|NO|1506679|1573248|1.04|
|1|32768|YES|2014216|2363490|1.17|
|1|32768|NO|1983238|2088212|1.05|
|1|65536|YES|2010050|2450872|1.22|
|1|65536|NO|2280630|2523812|1.11|
|1|131072|YES|2321543|2885393|1.24|
|1|131072|NO|2568911|2915340|1.13|
|8|4096|YES|4328154|4445315|1.03|
|8|4096|NO|4516588|4374181|0.97|
|8|8192|YES|7410554|7640191|1.03|
|8|8192|NO|7715609|7718173|1.00|
|8|16384|YES|11412928|12422567|1.09|
|8|16384|NO|11296845|11624159|1.03|
|8|32768|YES|11428369|12525670|1.10|
|8|32768|NO|14957242|15904745|1.06|
|8|65536|YES|13453756|15308455|1.14|
|8|65536|NO|17671055|19332987|1.09|
|8|131072|YES|17047482|20930042|1.23|
|8|131072|NO|19779711|21761656|1.10|
</details>
##### Evaluation performance: NVIDIA DGX-1 (8x V100 16GB)
Our results were obtained by running the benchmark script (`main.py --evaluate --benchmark`) in the TensorFlow2 NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs.
|GPUs|Batch size / GPU|XLA|Throughput \[samples/s\] TF32|Throughput \[samples/s\]AMP|Throughput speedup AMP to TF32
|----|----------------|---|------------------------------|-----------------------------|-------------------------------
|1|4096|NO|375928|439395|1.17|
|1|8192|NO|526780|754517|1.43|
|1|16384|NO|673971|1133696|1.68|
|1|32768|NO|791637|1470221|1.86|
|1|65536|NO|842831|1753500|2.08|
|1|131072|NO|892941|1990898|2.23|
|8|4096|NO|2893390|3278473|1.13|
|8|8192|NO|3881996|5337866|1.38|
|8|16384|NO|5003135|8086178|1.62|
|8|32768|NO|6124648|11087247|1.81|
|8|65536|NO|6631887|13233484|2.00|
|8|131072|NO|7030438|15081861|2.15|
For more results go to the expandable table below.
<details>
<summary>Full tabular data for evaluation performance for DGX-1 V100 results</summary>
|GPUs|Batch size / GPU|XLA|Throughput \[samples/s\] TF32|Throughput \[samples/s\]AMP|Throughput speedup AMP to TF32
|----|----------------|---|------------------------------|-----------------------------|-------------------------------
|1|4096|YES|356963|459481|1.29|
|1|4096|NO|375928|439395|1.17|
|1|8192|YES|517016|734515|1.42|
|1|8192|NO|526780|754517|1.43|
|1|16384|YES|660772|1150292|1.74|
|1|16384|NO|673971|1133696|1.68|
|1|32768|YES|776357|1541699|1.99|
|1|32768|NO|791637|1470221|1.86|
|1|65536|YES|863311|1962275|2.27|
|1|65536|NO|842831|1753500|2.08|
|1|131072|YES|928290|2235968|2.41|
|1|131072|NO|892941|1990898|2.23|
|8|4096|YES|2680961|3182591|1.19|
|8|4096|NO|2893390|3278473|1.13|
|8|8192|YES|3738172|5185972|1.39|
|8|8192|NO|3881996|5337866|1.38|
|8|16384|YES|4961435|8170489|1.65|
|8|16384|NO|5003135|8086178|1.62|
|8|32768|YES|6218767|11658218|1.87|
|8|32768|NO|6124648|11087247|1.81|
|8|65536|YES|6808677|14921211|2.19|
|8|65536|NO|6631887|13233484|2.00|
|8|131072|YES|7205370|16923294|2.35|
|8|131072|NO|7030438|15081861|2.15|
</details>
## Release notes
### Changelog
February 2021
Initial release
### Known issues
* In this model the TF32 precision can in some cases be as fast as the FP16 precision on Ampere GPUs. This is because TF32 also uses Tensor Cores and doesn't need any additional logic such as maintaining FP32 master weights and casts. However, please note that W&D is, by modern recommender standards, a very small model. Larger models should still see significant benefits of using FP16 math.

View file

@ -0,0 +1,139 @@
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from functools import partial
from multiprocessing import cpu_count
import tensorflow as tf
from data.outbrain.features import get_features_keys
def _consolidate_batch(elem):
label = elem.pop('label')
reshaped_label = tf.reshape(label, [-1, label.shape[-1]])
features = get_features_keys()
reshaped_elem = {
key: tf.reshape(elem[key], [-1, elem[key].shape[-1]])
for key in elem
if key in features
}
return reshaped_elem, reshaped_label
def get_parse_function(feature_spec):
def _parse_function(example_proto):
return tf.io.parse_single_example(example_proto, feature_spec)
return _parse_function
def train_input_fn(
filepath_pattern,
feature_spec,
records_batch_size,
num_gpus=1,
id=0):
_parse_function = get_parse_function(feature_spec)
dataset = tf.data.Dataset.list_files(
file_pattern=filepath_pattern
)
dataset = dataset.interleave(
lambda x: tf.data.TFRecordDataset(x),
cycle_length=cpu_count() // num_gpus,
block_length=1
)
dataset = dataset.map(
map_func=_parse_function,
num_parallel_calls=tf.data.experimental.AUTOTUNE
)
dataset = dataset.shard(num_gpus, id)
dataset = dataset.shuffle(records_batch_size * 8)
dataset = dataset.repeat(
count=None
)
dataset = dataset.batch(
batch_size=records_batch_size,
drop_remainder=False
)
dataset = dataset.map(
map_func=partial(
_consolidate_batch
),
num_parallel_calls=tf.data.experimental.AUTOTUNE
)
dataset = dataset.prefetch(
buffer_size=tf.data.experimental.AUTOTUNE
)
return dataset
def eval_input_fn(
filepath_pattern,
feature_spec,
records_batch_size,
num_gpus=1,
repeat=1,
id=0):
dataset = tf.data.Dataset.list_files(
file_pattern=filepath_pattern,
shuffle=False
)
dataset = tf.data.TFRecordDataset(
filenames=dataset,
num_parallel_reads=1
)
dataset = dataset.shard(num_gpus, id)
dataset = dataset.repeat(
count=repeat
)
dataset = dataset.batch(
batch_size=records_batch_size,
drop_remainder=False
)
dataset = dataset.apply(
transformation_func=tf.data.experimental.parse_example_dataset(
features=feature_spec,
num_parallel_calls=1
)
)
dataset = dataset.map(
map_func=partial(
_consolidate_batch
),
num_parallel_calls=None
)
dataset = dataset.prefetch(
buffer_size=1
)
return dataset

View file

@ -0,0 +1,131 @@
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import logging
import tensorflow as tf
PREBATCH_SIZE = 4096
DISPLAY_ID_COLUMN = 'display_id'
TIME_COLUMNS = [
'doc_event_days_since_published_log_01scaled',
'doc_ad_days_since_published_log_01scaled'
]
GB_COLUMNS = [
'pop_document_id',
'pop_publisher_id',
'pop_source_id',
'pop_ad_id',
'pop_advertiser_id',
'pop_campain_id',
'doc_views_log_01scaled',
'ad_views_log_01scaled'
]
SIM_COLUMNS = [
'doc_event_doc_ad_sim_categories',
'doc_event_doc_ad_sim_topics',
'doc_event_doc_ad_sim_entities'
]
NUMERIC_COLUMNS = TIME_COLUMNS + SIM_COLUMNS + GB_COLUMNS
CATEGORICAL_COLUMNS = [
'ad_id',
'campaign_id',
'doc_event_id',
'event_platform',
'doc_id',
'ad_advertiser',
'doc_event_source_id',
'doc_event_publisher_id',
'doc_ad_source_id',
'doc_ad_publisher_id',
'event_geo_location',
'event_country',
'event_country_state',
]
HASH_BUCKET_SIZES = {
'doc_event_id': 300000,
'ad_id': 250000,
'doc_id': 100000,
'doc_ad_source_id': 4000,
'doc_event_source_id': 4000,
'event_geo_location': 2500,
'ad_advertiser': 2500,
'event_country_state': 2000,
'doc_ad_publisher_id': 1000,
'doc_event_publisher_id': 1000,
'event_country': 300,
'event_platform': 4,
'campaign_id': 5000
}
EMBEDDING_DIMENSIONS = {
'doc_event_id': 128,
'ad_id': 128,
'doc_id': 128,
'doc_ad_source_id': 64,
'doc_event_source_id': 64,
'event_geo_location': 64,
'ad_advertiser': 64,
'event_country_state': 64,
'doc_ad_publisher_id': 64,
'doc_event_publisher_id': 64,
'event_country': 64,
'event_platform': 16,
'campaign_id': 128
}
EMBEDDING_TABLE_SHAPES = {
column: (HASH_BUCKET_SIZES[column], EMBEDDING_DIMENSIONS[column]) for column in CATEGORICAL_COLUMNS
}
def get_features_keys():
return CATEGORICAL_COLUMNS + NUMERIC_COLUMNS + [DISPLAY_ID_COLUMN]
def get_feature_columns():
logger = logging.getLogger('tensorflow')
wide_columns, deep_columns = [], []
for column_name in CATEGORICAL_COLUMNS:
if column_name in EMBEDDING_TABLE_SHAPES:
categorical_column = tf.feature_column.categorical_column_with_identity(
column_name, num_buckets=EMBEDDING_TABLE_SHAPES[column_name][0])
wrapped_column = tf.feature_column.embedding_column(
categorical_column,
dimension=EMBEDDING_TABLE_SHAPES[column_name][1],
combiner='mean')
else:
raise ValueError(f'Unexpected categorical column found {column_name}')
wide_columns.append(categorical_column)
deep_columns.append(wrapped_column)
numerics = [tf.feature_column.numeric_column(column_name, shape=(1,), dtype=tf.float32)
for column_name in NUMERIC_COLUMNS]
wide_columns.extend(numerics)
deep_columns.extend(numerics)
logger.warning('deep columns: {}'.format(len(deep_columns)))
logger.warning('wide columns: {}'.format(len(wide_columns)))
logger.warning('wide&deep intersection: {}'.format(len(set(wide_columns).intersection(set(deep_columns)))))
return wide_columns, deep_columns

View file

@ -0,0 +1,51 @@
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import logging
import os
os.environ['TF_MEMORY_ALLOCATION'] = "0.0"
from data.outbrain.nvtabular.utils.converter import nvt_to_tfrecords
from data.outbrain.nvtabular.utils.workflow import execute_pipeline
from data.outbrain.nvtabular.utils.arguments import parse_args
from data.outbrain.nvtabular.utils.setup import create_config
def is_empty(path):
return not os.path.exists(path) or (not os.path.isfile(path) and not os.listdir(path))
def main():
args = parse_args()
config = create_config(args)
if is_empty(args.metadata_path):
logging.warning('Creating new stats data into {}'.format(config['stats_file']))
execute_pipeline(config)
else:
logging.warning('Directory is not empty {args.metadata_path}')
logging.warning('Skipping NVTabular preprocessing')
if os.path.exists(config['output_train_folder']) and os.path.exists(config['output_valid_folder']):
if is_empty(config['tfrecords_path']):
logging.warning('Executing NVTabular parquets to TFRecords conversion')
nvt_to_tfrecords(config)
else:
logging.warning(f"Directory is not empty {config['tfrecords_path']}")
logging.warning('Skipping TFrecords conversion')
else:
logging.warning(f'Train and validation dataset not found in {args.metadata_path}')
if __name__ == '__main__':
main()

View file

@ -0,0 +1,52 @@
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
DEFAULT_DIR = '/outbrain'
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument(
'--data_path',
help='Path with the data required for NVTabular preprocessing. '
'If stats already exists under metadata_path preprocessing phase will be skipped.',
type=str,
default=f'{DEFAULT_DIR}/orig',
nargs='+'
)
parser.add_argument(
'--metadata_path',
help='Path with preprocessed NVTabular stats',
type=str,
default=f'{DEFAULT_DIR}/data',
nargs='+'
)
parser.add_argument(
'--tfrecords_path',
help='Path where converted tfrecords will be stored',
type=str,
default=f'{DEFAULT_DIR}/tfrecords',
nargs='+'
)
parser.add_argument(
'--workers',
help='Number of TfRecords files to be created',
type=int,
default=40
)
return parser.parse_args()

View file

@ -0,0 +1,158 @@
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import logging
import os
from multiprocessing import Process
import pandas as pd
import tensorflow as tf
from tensorflow_transform.tf_metadata import dataset_metadata
from tensorflow_transform.tf_metadata import dataset_schema
from tensorflow_transform.tf_metadata import metadata_io
from data.outbrain.features import PREBATCH_SIZE
from data.outbrain.nvtabular.utils.feature_description import transform_nvt_to_spark, CATEGORICAL_COLUMNS, \
DISPLAY_ID_COLUMN, EXCLUDE_COLUMNS
def create_metadata(df, prebatch_size, output_path):
fixed_shape = [prebatch_size, 1]
spec = {}
for column in df:
if column in CATEGORICAL_COLUMNS + [DISPLAY_ID_COLUMN]:
spec[transform_nvt_to_spark(column)] = tf.io.FixedLenFeature(shape=fixed_shape, dtype=tf.int64,
default_value=None)
else:
spec[transform_nvt_to_spark(column)] = tf.io.FixedLenFeature(shape=fixed_shape, dtype=tf.float32,
default_value=None)
metadata = dataset_metadata.DatasetMetadata(dataset_schema.from_feature_spec(spec))
metadata_io.write_metadata(metadata, output_path)
def create_tf_example(df, start_index, offset):
parsed_features = {}
records = df.loc[start_index:start_index + offset - 1]
for column in records:
if column in CATEGORICAL_COLUMNS + [DISPLAY_ID_COLUMN]:
feature = tf.train.Feature(int64_list=tf.train.Int64List(value=records[column].to_numpy()))
else:
feature = tf.train.Feature(float_list=tf.train.FloatList(value=records[column].to_numpy()))
parsed_features[transform_nvt_to_spark(column)] = feature
features = tf.train.Features(feature=parsed_features)
return tf.train.Example(features=features)
def create_tf_records(df, prebatch_size, output_path):
with tf.io.TFRecordWriter(output_path) as out_file:
start_index = df.index[0]
for index in range(start_index, df.shape[0] + start_index - prebatch_size + 1, prebatch_size):
example = create_tf_example(df, index, prebatch_size)
out_file.write(example.SerializeToString())
def convert(path_to_nvt_dataset, output_path, prebatch_size, exclude_columns, workers=6):
train_path = os.path.join(path_to_nvt_dataset, 'train')
valid_path = os.path.join(path_to_nvt_dataset, 'valid')
output_metadata_path = os.path.join(output_path, 'transformed_metadata')
output_train_path = os.path.join(output_path, 'train')
output_valid_path = os.path.join(output_path, 'eval')
for directory in [output_metadata_path, output_train_path, output_valid_path]:
os.makedirs(directory, exist_ok=True)
train_workers, valid_workers = [], []
output_train_paths, output_valid_paths = [], []
for worker in range(workers):
part_number = str(worker).rjust(5, '0')
record_train_path = os.path.join(output_train_path, f'part-r-{part_number}')
record_valid_path = os.path.join(output_valid_path, f'part-r-{part_number}')
output_train_paths.append(record_train_path)
output_valid_paths.append(record_valid_path)
logging.warning(f'Prebatch size set to {prebatch_size}')
logging.warning(f'Number of TFRecords set to {workers}')
logging.warning(f'Reading training parquets from {train_path}')
df_train = pd.read_parquet(train_path, engine='pyarrow')
logging.warning('Done')
logging.warning(f'Removing training columns {exclude_columns}')
df_train = df_train.drop(columns=exclude_columns)
logging.warning('Done')
logging.warning(f'Creating metadata in {output_metadata_path}')
metadata_worker = Process(target=create_metadata, args=(df_train, prebatch_size, output_metadata_path))
metadata_worker.start()
logging.warning(f'Creating training TFrecords to {output_train_paths}')
shape = df_train.shape[0] // workers
shape = shape + (prebatch_size - shape % prebatch_size)
for worker_index in range(workers):
df_subset = df_train.loc[worker_index * shape:(worker_index + 1) * shape - 1]
worker = Process(target=create_tf_records, args=(df_subset, prebatch_size, output_train_paths[worker_index]))
train_workers.append(worker)
for worker in train_workers:
worker.start()
logging.warning(f'Reading validation parquets from {valid_path}')
df_valid = pd.read_parquet(valid_path, engine='pyarrow')
logging.warning('Done')
logging.warning(f'Removing validation columns {exclude_columns}')
df_valid = df_valid.drop(columns=exclude_columns)
logging.warning('Done')
logging.warning(f'Creating validation TFrecords to {output_valid_paths}')
shape = df_valid.shape[0] // workers
shape = shape + (prebatch_size - shape % prebatch_size)
for worker_index in range(workers):
df_subset = df_valid.loc[worker_index * shape:(worker_index + 1) * shape - 1]
worker = Process(target=create_tf_records, args=(df_subset, prebatch_size, output_valid_paths[worker_index]))
valid_workers.append(worker)
for worker in valid_workers:
worker.start()
for worker_index in range(workers):
metadata_worker.join()
train_workers[worker_index].join()
valid_workers[worker_index].join()
logging.warning('Done')
del df_train
del df_valid
return output_path
def nvt_to_tfrecords(config):
path_to_nvt_dataset = config['output_bucket_folder']
output_path = config['tfrecords_path']
workers = config['workers']
convert(
path_to_nvt_dataset=path_to_nvt_dataset,
output_path=output_path,
prebatch_size=PREBATCH_SIZE,
exclude_columns=EXCLUDE_COLUMNS,
workers=workers
)

View file

@ -0,0 +1,108 @@
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
DISPLAY_ID_COLUMN = 'display_id'
BASE_CONT_COLUMNS = ['publish_time', 'publish_time_promo', 'timestamp', 'document_id_promo_clicked_sum_ctr',
'publisher_id_promo_clicked_sum_ctr',
'source_id_promo_clicked_sum_ctr', 'document_id_promo_count', 'publish_time_days_since_published',
'ad_id_clicked_sum_ctr',
'advertiser_id_clicked_sum_ctr', 'campaign_id_clicked_sum_ctr', 'ad_id_count',
'publish_time_promo_days_since_published']
SIM_COLUMNS = [
'doc_event_doc_ad_sim_categories',
'doc_event_doc_ad_sim_topics',
'doc_event_doc_ad_sim_entities'
]
CONTINUOUS_COLUMNS = BASE_CONT_COLUMNS + SIM_COLUMNS + [DISPLAY_ID_COLUMN]
groupby_columns = ['ad_id_count', 'ad_id_clicked_sum', 'source_id_promo_count', 'source_id_promo_clicked_sum',
'document_id_promo_count', 'document_id_promo_clicked_sum',
'publisher_id_promo_count', 'publisher_id_promo_clicked_sum', 'advertiser_id_count',
'advertiser_id_clicked_sum',
'campaign_id_count', 'campaign_id_clicked_sum']
ctr_columns = ['advertiser_id_clicked_sum_ctr', 'document_id_promo_clicked_sum_ctr',
'publisher_id_promo_clicked_sum_ctr',
'source_id_promo_clicked_sum_ctr',
'ad_id_clicked_sum_ctr', 'campaign_id_clicked_sum_ctr']
exclude_conts = ['publish_time', 'publish_time_promo', 'timestamp']
NUMERIC_COLUMNS = [col for col in CONTINUOUS_COLUMNS if col not in exclude_conts]
CATEGORICAL_COLUMNS = ['ad_id', 'document_id', 'platform', 'document_id_promo', 'campaign_id', 'advertiser_id',
'source_id',
'publisher_id', 'source_id_promo', 'publisher_id_promo', 'geo_location', 'geo_location_country',
'geo_location_state']
EXCLUDE_COLUMNS = [
'publish_time',
'publish_time_promo',
'timestamp',
'ad_id_clicked_sum',
'source_id_promo_count',
'source_id_promo_clicked_sum',
'document_id_promo_clicked_sum',
'publisher_id_promo_count', 'publisher_id_promo_clicked_sum',
'advertiser_id_count',
'advertiser_id_clicked_sum',
'campaign_id_count',
'campaign_id_clicked_sum',
'uuid',
'day_event'
]
nvt_to_spark = {
'ad_id': 'ad_id',
'clicked': 'label',
'display_id': 'display_id',
'document_id': 'doc_event_id',
'platform': 'event_platform',
'document_id_promo': 'doc_id',
'campaign_id': 'campaign_id',
'advertiser_id': 'ad_advertiser',
'source_id': 'doc_event_source_id',
'publisher_id': 'doc_event_publisher_id',
'source_id_promo': 'doc_ad_source_id',
'publisher_id_promo': 'doc_ad_publisher_id',
'geo_location': 'event_geo_location',
'geo_location_country': 'event_country',
'geo_location_state': 'event_country_state',
'document_id_promo_clicked_sum_ctr': 'pop_document_id',
'publisher_id_promo_clicked_sum_ctr': 'pop_publisher_id',
'source_id_promo_clicked_sum_ctr': 'pop_source_id',
'document_id_promo_count': 'doc_views_log_01scaled',
'publish_time_days_since_published': 'doc_event_days_since_published_log_01scaled',
'ad_id_clicked_sum_ctr': 'pop_ad_id',
'advertiser_id_clicked_sum_ctr': 'pop_advertiser_id',
'campaign_id_clicked_sum_ctr': 'pop_campain_id',
'ad_id_count': 'ad_views_log_01scaled',
'publish_time_promo_days_since_published': 'doc_ad_days_since_published_log_01scaled',
'doc_event_doc_ad_sim_categories': 'doc_event_doc_ad_sim_categories',
'doc_event_doc_ad_sim_topics': 'doc_event_doc_ad_sim_topics',
'doc_event_doc_ad_sim_entities': 'doc_event_doc_ad_sim_entities'
}
spark_to_nvt = {item: key for key, item in nvt_to_spark.items()}
def transform_nvt_to_spark(column):
return nvt_to_spark[column]
def transform_spark_to_nvt(column):
return spark_to_nvt[column]

View file

@ -0,0 +1,48 @@
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
from data.outbrain.features import HASH_BUCKET_SIZES
from data.outbrain.nvtabular.utils.feature_description import transform_spark_to_nvt
def create_config(args):
stats_file = os.path.join(args.metadata_path, 'stats_wnd_workflow')
data_bucket_folder = args.data_path
output_bucket_folder = args.metadata_path
output_train_folder = os.path.join(output_bucket_folder, 'train/')
temporary_folder = os.path.join('/tmp', 'preprocessed')
train_path = os.path.join(temporary_folder, 'train_gdf.parquet')
valid_path = os.path.join(temporary_folder, 'valid_gdf.parquet')
output_valid_folder = os.path.join(output_bucket_folder, 'valid/')
tfrecords_path = args.tfrecords_path
workers = args.workers
hash_spec = {transform_spark_to_nvt(column): hash for column, hash in HASH_BUCKET_SIZES.items()}
config = {
'stats_file': stats_file,
'data_bucket_folder': data_bucket_folder,
'output_bucket_folder': output_bucket_folder,
'output_train_folder': output_train_folder,
'temporary_folder': temporary_folder,
'train_path': train_path,
'valid_path': valid_path,
'output_valid_folder': output_valid_folder,
'tfrecords_path': tfrecords_path,
'workers': workers,
'hash_spec': hash_spec
}
return config

View file

@ -0,0 +1,254 @@
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import shutil
import cudf
import cupy
import nvtabular as nvt
import rmm
from dask.distributed import Client
from dask_cuda import LocalCUDACluster
from data.outbrain.nvtabular.utils.feature_description import CATEGORICAL_COLUMNS, CONTINUOUS_COLUMNS, \
DISPLAY_ID_COLUMN, groupby_columns, ctr_columns
from nvtabular.io import Shuffle
from nvtabular.ops import Normalize, FillMedian, FillMissing, LogOp, LambdaOp, JoinGroupby, HashBucket
from nvtabular.ops.column_similarity import ColumnSimilarity
from nvtabular.utils import device_mem_size, get_rmm_size
TIMESTAMP_DELTA = 1465876799998
def get_devices():
try:
devices = [int(device) for device in os.environ["CUDA_VISIBLE_DEVICES"].split(",")]
except KeyError:
from pynvml import nvmlInit, nvmlDeviceGetCount
nvmlInit()
devices = list(range(nvmlDeviceGetCount()))
return devices
def _calculate_delta(col, gdf):
col.loc[col == ''] = None
col = col.astype('datetime64[ns]')
timestamp = (gdf['timestamp'] + TIMESTAMP_DELTA).astype('datetime64[ms]')
delta = (timestamp - col).dt.days
delta = delta * (delta >= 0) * (delta <= 10 * 365)
return delta
def _df_to_coo(df, row='document_id', col=None, data='confidence_level'):
return cupy.sparse.coo_matrix((df[data].values, (df[row].values, df[col].values)))
def setup_rmm_pool(client, pool_size):
pool_size = get_rmm_size(pool_size)
client.run(rmm.reinitialize, pool_allocator=True, initial_pool_size=pool_size)
return None
def create_client(devices, local_directory):
client = None
if len(devices) > 1:
device_size = device_mem_size(kind="total")
device_limit = int(0.8 * device_size)
device_pool_size = int(0.8 * device_size)
cluster = LocalCUDACluster(
n_workers=len(devices),
CUDA_VISIBLE_DEVICES=",".join(str(x) for x in devices),
device_memory_limit=device_limit,
local_directory=local_directory
)
client = Client(cluster)
setup_rmm_pool(client, device_pool_size)
return client
def create_workflow(data_bucket_folder, output_bucket_folder, hash_spec, devices, local_directory):
rmm.reinitialize(managed_memory=False)
documents_categories_path = os.path.join(data_bucket_folder, 'documents_categories.csv')
documents_topics_path = os.path.join(data_bucket_folder, 'documents_topics.csv')
documents_entities_path = os.path.join(data_bucket_folder, 'documents_entities.csv')
documents_categories_cudf = cudf.read_csv(documents_categories_path)
documents_topics_cudf = cudf.read_csv(documents_topics_path)
documents_entities_cudf = cudf.read_csv(documents_entities_path)
documents_entities_cudf['entity_id'] = documents_entities_cudf['entity_id'].astype('category').cat.codes
categories = _df_to_coo(documents_categories_cudf, col='category_id')
topics = _df_to_coo(documents_topics_cudf, col='topic_id')
entities = _df_to_coo(documents_entities_cudf, col='entity_id')
del documents_categories_cudf, documents_topics_cudf, documents_entities_cudf
ctr_thresh = {
'ad_id': 5,
'source_id_promo': 10,
'publisher_id_promo': 10,
'advertiser_id': 10,
'campaign_id': 10,
'document_id_promo': 5,
}
client = create_client(
devices=devices,
local_directory=local_directory
)
workflow = nvt.Workflow(
cat_names=CATEGORICAL_COLUMNS,
cont_names=CONTINUOUS_COLUMNS,
label_name=['clicked'],
client=client
)
workflow.add_feature([
LambdaOp(
op_name='country',
f=lambda col, gdf: col.str.slice(0, 2),
columns=['geo_location'], replace=False),
LambdaOp(
op_name='state',
f=lambda col, gdf: col.str.slice(0, 5),
columns=['geo_location'], replace=False),
LambdaOp(
op_name='days_since_published',
f=_calculate_delta,
columns=['publish_time', 'publish_time_promo'], replace=False),
FillMedian(columns=['publish_time_days_since_published', 'publish_time_promo_days_since_published']),
JoinGroupby(columns=['ad_id', 'source_id_promo', 'document_id_promo', 'publisher_id_promo', 'advertiser_id',
'campaign_id'],
cont_names=['clicked'], out_path=output_bucket_folder, stats=['sum', 'count']),
LambdaOp(
op_name='ctr',
f=lambda col, gdf: ((col) / (gdf[col.name.replace('_clicked_sum', '_count')])).where(
gdf[col.name.replace('_clicked_sum', '_count')] >= ctr_thresh[col.name.replace('_clicked_sum', '')], 0),
columns=['ad_id_clicked_sum', 'source_id_promo_clicked_sum', 'document_id_promo_clicked_sum',
'publisher_id_promo_clicked_sum',
'advertiser_id_clicked_sum', 'campaign_id_clicked_sum'], replace=False),
FillMissing(columns=groupby_columns + ctr_columns),
LogOp(
columns=groupby_columns + ['publish_time_days_since_published', 'publish_time_promo_days_since_published']),
Normalize(columns=groupby_columns),
ColumnSimilarity('doc_event_doc_ad_sim_categories', 'document_id', categories, 'document_id_promo',
metric='tfidf', on_device=False),
ColumnSimilarity('doc_event_doc_ad_sim_topics', 'document_id', topics, 'document_id_promo', metric='tfidf',
on_device=False),
ColumnSimilarity('doc_event_doc_ad_sim_entities', 'document_id', entities, 'document_id_promo', metric='tfidf',
on_device=False)
])
workflow.add_cat_preprocess([
HashBucket(hash_spec)
])
workflow.finalize()
return workflow
def create_parquets(data_bucket_folder, train_path, valid_path):
cupy.random.seed(seed=0)
rmm.reinitialize(managed_memory=True)
documents_meta_path = os.path.join(data_bucket_folder, 'documents_meta.csv')
clicks_train_path = os.path.join(data_bucket_folder, 'clicks_train.csv')
events_path = os.path.join(data_bucket_folder, 'events.csv')
promoted_content_path = os.path.join(data_bucket_folder, 'promoted_content.csv')
documents_meta = cudf.read_csv(documents_meta_path, na_values=['\\N', ''])
documents_meta = documents_meta.dropna(subset='source_id')
documents_meta['publisher_id'].fillna(
documents_meta['publisher_id'].isnull().cumsum() + documents_meta['publisher_id'].max() + 1, inplace=True)
merged = (cudf.read_csv(clicks_train_path, na_values=['\\N', ''])
.merge(cudf.read_csv(events_path, na_values=['\\N', '']), on=DISPLAY_ID_COLUMN, how='left',
suffixes=('', '_event'))
.merge(cudf.read_csv(promoted_content_path, na_values=['\\N', '']), on='ad_id',
how='left',
suffixes=('', '_promo'))
.merge(documents_meta, on='document_id', how='left')
.merge(documents_meta, left_on='document_id_promo', right_on='document_id', how='left',
suffixes=('', '_promo')))
merged['day_event'] = (merged['timestamp'] / 1000 / 60 / 60 / 24).astype(int)
merged['platform'] = merged['platform'].fillna(1)
merged['platform'] = merged['platform'] - 1
display_event = merged[[DISPLAY_ID_COLUMN, 'day_event']].drop_duplicates().reset_index()
random_state = cudf.Series(cupy.random.uniform(size=len(display_event)))
valid_ids, train_ids = display_event.scatter_by_map(
((display_event.day_event <= 10) & (random_state > 0.2)).astype(int))
valid_ids = valid_ids[DISPLAY_ID_COLUMN].drop_duplicates()
train_ids = train_ids[DISPLAY_ID_COLUMN].drop_duplicates()
valid_set = merged[merged[DISPLAY_ID_COLUMN].isin(valid_ids)]
train_set = merged[merged[DISPLAY_ID_COLUMN].isin(train_ids)]
valid_set = valid_set.sort_values(DISPLAY_ID_COLUMN)
train_set.to_parquet(train_path, compression=None)
valid_set.to_parquet(valid_path, compression=None)
del merged, train_set, valid_set
def save_stats(data_bucket_folder, output_bucket_folder,
output_train_folder, train_path, output_valid_folder,
valid_path, stats_file, hash_spec, local_directory):
devices = get_devices()
shuffle = Shuffle.PER_PARTITION if len(devices) > 1 else True
workflow = create_workflow(data_bucket_folder=data_bucket_folder,
output_bucket_folder=output_bucket_folder,
hash_spec=hash_spec,
devices=devices,
local_directory=local_directory)
train_dataset = nvt.Dataset(train_path, part_mem_fraction=0.12)
valid_dataset = nvt.Dataset(valid_path, part_mem_fraction=0.12)
workflow.apply(train_dataset, record_stats=True, output_path=output_train_folder, shuffle=shuffle,
out_files_per_proc=5)
workflow.apply(valid_dataset, record_stats=False, output_path=output_valid_folder, shuffle=None,
out_files_per_proc=None)
workflow.save_stats(stats_file)
return workflow
def clean(path):
shutil.rmtree(path)
def execute_pipeline(config):
required_folders = [config['temporary_folder'], config['output_train_folder'], config['output_valid_folder']]
for folder in required_folders:
os.makedirs(folder, exist_ok=True)
create_parquets(
data_bucket_folder=config['data_bucket_folder'],
train_path=config['train_path'],
valid_path=config['valid_path']
)
save_stats(
data_bucket_folder=config['data_bucket_folder'],
output_bucket_folder=config['output_bucket_folder'],
output_train_folder=config['output_train_folder'],
train_path=config['train_path'],
output_valid_folder=config['output_valid_folder'],
valid_path=config['valid_path'],
stats_file=config['stats_file'],
hash_spec=config['hash_spec'],
local_directory=config['temporary_folder']
)
clean(config['temporary_folder'])

View file

@ -0,0 +1,13 @@
state_abb,utc_dst_time_offset_cleaned
AB,-6.0
BC,-7.0
MB,-5.0
NB,-3.0
NL,-3.0
NS,-3.0
NU,-5.0
ON,-4.0
PE,-3.0
QC,-4.0
SK,-6.0
YT,-7.0
1 state_abb utc_dst_time_offset_cleaned
2 AB -6.0
3 BC -7.0
4 MB -5.0
5 NB -3.0
6 NL -3.0
7 NS -3.0
8 NU -5.0
9 ON -4.0
10 PE -3.0
11 QC -4.0
12 SK -6.0
13 YT -7.0

View file

@ -0,0 +1,247 @@
country_code,utc_dst_time_offset_cleaned
AX,3.0
AF,4.5
AL,2.0
DZ,1.0
AD,2.0
AO,1.0
AI,-4.0
AG,-4.0
AR,-3.0
AM,4.0
AW,-4.0
AU,10.0
AT,2.0
AZ,4.0
BS,-4.0
BH,3.0
BD,6.0
BB,-4.0
BY,3.0
BE,2.0
BZ,-6.0
BJ,1.0
BM,-3.0
BT,6.0
BO,-4.0
BA,2.0
BW,2.0
BR,-3.0
IO,6.0
BN,8.0
BG,3.0
BF,0.0
BI,2.0
KH,7.0
CM,1.0
CA,-5.0
BQ,-5.0
KY,-5.0
CF,1.0
TD,1.0
CL,-3.0
CN,8.0
CX,7.0
CC,6.5
CO,-5.0
KM,3.0
CD,1.0
CG,1.0
CK,-10.0
CR,-6.0
CI,0.0
HR,2.0
CW,-4.0
CY,3.0
CZ,2.0
DK,2.0
DJ,3.0
DM,-4.0
DO,-4.0
TL,9.0
EC,-5.0
EG,2.0
SV,-6.0
GQ,1.0
ER,3.0
EE,3.0
ET,3.0
FK,-3.0
FO,1.0
FJ,12.0
FI,3.0
FR,2.0
GF,-3.0
PF,-10.0
GA,1.0
GM,0.0
GE,4.0
DE,2.0
GH,0.0
GI,2.0
GR,3.0
GL,-2.0
GD,-4.0
GP,-4.0
GU,10.0
GT,-6.0
GG,1.0
GN,0.0
GW,0.0
GY,-4.0
HT,-5.0
HN,-6.0
HK,8.0
HU,2.0
IS,0.0
IN,5.5
ID,8.0
IR,4.5
IQ,3.0
IE,1.0
IM,1.0
IL,3.0
IT,2.0
JM,-5.0
JP,9.0
JE,1.0
JO,3.0
KZ,5.0
KE,3.0
KI,13.0
KP,-4.0
KR,-4.0
KP,8.5
KR,8.5
KP,9.0
KR,9.0
KW,3.0
KG,6.0
LA,7.0
LV,3.0
LB,3.0
LS,2.0
LR,0.0
LY,2.0
LI,2.0
LT,3.0
LU,2.0
MO,8.0
MK,2.0
MG,3.0
MW,2.0
MY,8.0
MV,5.0
ML,0.0
MT,2.0
MH,12.0
MQ,-4.0
MR,0.0
MU,4.0
YT,3.0
MX,-5.0
FM,10.0
MD,3.0
MC,2.0
MN,9.0
ME,2.0
MS,-4.0
MA,1.0
MZ,2.0
MM,6.5
NA,1.0
NR,12.0
NP,5.0
NL,2.0
NC,11.0
NZ,12.0
NI,-6.0
NE,1.0
NG,1.0
NU,-11.0
NF,11.0
MP,10.0
NO,2.0
OM,4.0
PK,5.0
PW,9.0
PS,3.0
PA,-5.0
PG,10.0
PY,-4.0
PE,-5.0
PH,8.0
PN,-8.0
PL,2.0
PT,1.0
PR,-4.0
QA,3.0
RE,4.0
RO,3.0
RU,7.0
RW,2.0
BL,-4.0
AS,-11.0
WS,-11.0
AS,13.0
WS,13.0
SM,2.0
ST,0.0
SA,3.0
SN,0.0
RS,2.0
SC,4.0
SL,0.0
SG,8.0
SK,2.0
SI,2.0
SB,11.0
SO,3.0
ZA,2.0
GS,-2.0
SS,3.0
ES,2.0
LK,5.5
SH,0.0
KN,-4.0
SX,-4.0
MF,-4.0
SD,3.0
SR,-3.0
SJ,2.0
SZ,2.0
SE,2.0
CH,2.0
SY,3.0
TW,8.0
TJ,5.0
TZ,3.0
TH,7.0
TG,0.0
TK,13.0
TO,13.0
TT,-4.0
TN,1.0
TR,3.0
TM,5.0
TC,-4.0
TV,12.0
UG,3.0
UA,3.0
AE,4.0
GB,1.0
US,-7.0
UY,-3.0
UZ,5.0
VU,11.0
VA,2.0
VE,-4.0
VN,7.0
VG,-4.0
VI,-4.0
VG,-4.0
VI,-4.0
WF,12.0
YE,3.0
ZM,2.0
ZW,2.0
1 country_code utc_dst_time_offset_cleaned
2 AX 3.0
3 AF 4.5
4 AL 2.0
5 DZ 1.0
6 AD 2.0
7 AO 1.0
8 AI -4.0
9 AG -4.0
10 AR -3.0
11 AM 4.0
12 AW -4.0
13 AU 10.0
14 AT 2.0
15 AZ 4.0
16 BS -4.0
17 BH 3.0
18 BD 6.0
19 BB -4.0
20 BY 3.0
21 BE 2.0
22 BZ -6.0
23 BJ 1.0
24 BM -3.0
25 BT 6.0
26 BO -4.0
27 BA 2.0
28 BW 2.0
29 BR -3.0
30 IO 6.0
31 BN 8.0
32 BG 3.0
33 BF 0.0
34 BI 2.0
35 KH 7.0
36 CM 1.0
37 CA -5.0
38 BQ -5.0
39 KY -5.0
40 CF 1.0
41 TD 1.0
42 CL -3.0
43 CN 8.0
44 CX 7.0
45 CC 6.5
46 CO -5.0
47 KM 3.0
48 CD 1.0
49 CG 1.0
50 CK -10.0
51 CR -6.0
52 CI 0.0
53 HR 2.0
54 CW -4.0
55 CY 3.0
56 CZ 2.0
57 DK 2.0
58 DJ 3.0
59 DM -4.0
60 DO -4.0
61 TL 9.0
62 EC -5.0
63 EG 2.0
64 SV -6.0
65 GQ 1.0
66 ER 3.0
67 EE 3.0
68 ET 3.0
69 FK -3.0
70 FO 1.0
71 FJ 12.0
72 FI 3.0
73 FR 2.0
74 GF -3.0
75 PF -10.0
76 GA 1.0
77 GM 0.0
78 GE 4.0
79 DE 2.0
80 GH 0.0
81 GI 2.0
82 GR 3.0
83 GL -2.0
84 GD -4.0
85 GP -4.0
86 GU 10.0
87 GT -6.0
88 GG 1.0
89 GN 0.0
90 GW 0.0
91 GY -4.0
92 HT -5.0
93 HN -6.0
94 HK 8.0
95 HU 2.0
96 IS 0.0
97 IN 5.5
98 ID 8.0
99 IR 4.5
100 IQ 3.0
101 IE 1.0
102 IM 1.0
103 IL 3.0
104 IT 2.0
105 JM -5.0
106 JP 9.0
107 JE 1.0
108 JO 3.0
109 KZ 5.0
110 KE 3.0
111 KI 13.0
112 KP -4.0
113 KR -4.0
114 KP 8.5
115 KR 8.5
116 KP 9.0
117 KR 9.0
118 KW 3.0
119 KG 6.0
120 LA 7.0
121 LV 3.0
122 LB 3.0
123 LS 2.0
124 LR 0.0
125 LY 2.0
126 LI 2.0
127 LT 3.0
128 LU 2.0
129 MO 8.0
130 MK 2.0
131 MG 3.0
132 MW 2.0
133 MY 8.0
134 MV 5.0
135 ML 0.0
136 MT 2.0
137 MH 12.0
138 MQ -4.0
139 MR 0.0
140 MU 4.0
141 YT 3.0
142 MX -5.0
143 FM 10.0
144 MD 3.0
145 MC 2.0
146 MN 9.0
147 ME 2.0
148 MS -4.0
149 MA 1.0
150 MZ 2.0
151 MM 6.5
152 NA 1.0
153 NR 12.0
154 NP 5.0
155 NL 2.0
156 NC 11.0
157 NZ 12.0
158 NI -6.0
159 NE 1.0
160 NG 1.0
161 NU -11.0
162 NF 11.0
163 MP 10.0
164 NO 2.0
165 OM 4.0
166 PK 5.0
167 PW 9.0
168 PS 3.0
169 PA -5.0
170 PG 10.0
171 PY -4.0
172 PE -5.0
173 PH 8.0
174 PN -8.0
175 PL 2.0
176 PT 1.0
177 PR -4.0
178 QA 3.0
179 RE 4.0
180 RO 3.0
181 RU 7.0
182 RW 2.0
183 BL -4.0
184 AS -11.0
185 WS -11.0
186 AS 13.0
187 WS 13.0
188 SM 2.0
189 ST 0.0
190 SA 3.0
191 SN 0.0
192 RS 2.0
193 SC 4.0
194 SL 0.0
195 SG 8.0
196 SK 2.0
197 SI 2.0
198 SB 11.0
199 SO 3.0
200 ZA 2.0
201 GS -2.0
202 SS 3.0
203 ES 2.0
204 LK 5.5
205 SH 0.0
206 KN -4.0
207 SX -4.0
208 MF -4.0
209 SD 3.0
210 SR -3.0
211 SJ 2.0
212 SZ 2.0
213 SE 2.0
214 CH 2.0
215 SY 3.0
216 TW 8.0
217 TJ 5.0
218 TZ 3.0
219 TH 7.0
220 TG 0.0
221 TK 13.0
222 TO 13.0
223 TT -4.0
224 TN 1.0
225 TR 3.0
226 TM 5.0
227 TC -4.0
228 TV 12.0
229 UG 3.0
230 UA 3.0
231 AE 4.0
232 GB 1.0
233 US -7.0
234 UY -3.0
235 UZ 5.0
236 VU 11.0
237 VA 2.0
238 VE -4.0
239 VN 7.0
240 VG -4.0
241 VI -4.0
242 VG -4.0
243 VI -4.0
244 WF 12.0
245 YE 3.0
246 ZM 2.0
247 ZW 2.0

View file

@ -0,0 +1,52 @@
state_abb,utc_dst_time_offset_cleaned
AL,-5.0
AK,-8.0
AZ,-7.0
AR,-5.0
CA,-7.0
CO,-6.0
CT,-4.0
DE,-4.0
DC,-4.0
FL,-4.0
GA,-4.0
HI,-10.0
ID,-6.0
IL,-5.0
IN,-4.0
IA,-5.0
KS,-5.0
KY,-4.0
LA,-5.0
ME,-4.0
MD,-4.0
MA,-4.0
MI,-4.0
MN,-5.0
MS,-5.0
MO,-5.0
MT,-6.0
NE,-5.0
NV,-7.0
NH,-4.0
NJ,-4.0
NM,-6.0
NY,-4.0
NC,-4.0
ND,-5.0
OH,-4.0
OK,-5.0
OR,-7.0
PA,-4.0
RI,-4.0
SC,-4.0
SD,-5.0
TN,-5.0
TX,-5.0
UT,-6.0
VT,-4.0
VA,-4.0
WA,-7.0
WV,-4.0
WI,-5.0
WY,-6.0
1 state_abb utc_dst_time_offset_cleaned
2 AL -5.0
3 AK -8.0
4 AZ -7.0
5 AR -5.0
6 CA -7.0
7 CO -6.0
8 CT -4.0
9 DE -4.0
10 DC -4.0
11 FL -4.0
12 GA -4.0
13 HI -10.0
14 ID -6.0
15 IL -5.0
16 IN -4.0
17 IA -5.0
18 KS -5.0
19 KY -4.0
20 LA -5.0
21 ME -4.0
22 MD -4.0
23 MA -4.0
24 MI -4.0
25 MN -5.0
26 MS -5.0
27 MO -5.0
28 MT -6.0
29 NE -5.0
30 NV -7.0
31 NH -4.0
32 NJ -4.0
33 NM -6.0
34 NY -4.0
35 NC -4.0
36 ND -5.0
37 OH -4.0
38 OK -5.0
39 OR -7.0
40 PA -4.0
41 RI -4.0
42 SC -4.0
43 SD -5.0
44 TN -5.0
45 TX -5.0
46 UT -6.0
47 VT -4.0
48 VA -4.0
49 WA -7.0
50 WV -4.0
51 WI -5.0
52 WY -6.0

View file

@ -0,0 +1,104 @@
#!/usr/bin/env python
# coding: utf-8
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from pyspark.context import SparkContext, SparkConf
from pyspark.sql.functions import col
from pyspark.sql.session import SparkSession
from pyspark.sql.types import IntegerType, StringType, StructType, StructField
OUTPUT_BUCKET_FOLDER = "/tmp/spark/preprocessed/"
DATA_BUCKET_FOLDER = "/outbrain/orig/"
SPARK_TEMP_FOLDER = "/tmp/spark/spark-temp/"
conf = SparkConf().setMaster('local[*]').set('spark.executor.memory', '40g').set('spark.driver.memory', '200g').set(
"spark.local.dir", SPARK_TEMP_FOLDER)
sc = SparkContext(conf=conf)
spark = SparkSession(sc)
print('Loading data...')
events_schema = StructType(
[StructField("display_id", IntegerType(), True),
StructField("uuid_event", StringType(), True),
StructField("document_id_event", IntegerType(), True),
StructField("timestamp_event", IntegerType(), True),
StructField("platform_event", IntegerType(), True),
StructField("geo_location_event", StringType(), True)]
)
events_df = spark.read.schema(events_schema) \
.options(header='true', inferschema='false', nullValue='\\N') \
.csv(DATA_BUCKET_FOLDER + "events.csv") \
.withColumn('day_event', (col('timestamp_event') / 1000 / 60 / 60 / 24).cast("int")) \
.alias('events')
events_df.count()
print('Drop rows with empty "geo_location"...')
events_df = events_df.dropna(subset="geo_location_event")
events_df.count()
print('Drop rows with empty "platform"...')
events_df = events_df.dropna(subset="platform_event")
events_df.count()
promoted_content_schema = StructType(
[StructField("ad_id", IntegerType(), True),
StructField("document_id_promo", IntegerType(), True),
StructField("campaign_id", IntegerType(), True),
StructField("advertiser_id", IntegerType(), True)]
)
promoted_content_df = spark.read.schema(promoted_content_schema) \
.options(header='true', inferschema='false', nullValue='\\N') \
.csv(DATA_BUCKET_FOLDER + "promoted_content.csv") \
.alias('promoted_content')
clicks_train_schema = StructType(
[StructField("display_id", IntegerType(), True),
StructField("ad_id", IntegerType(), True),
StructField("clicked", IntegerType(), True)]
)
clicks_train_df = spark.read.schema(clicks_train_schema) \
.options(header='true', inferschema='false', nullValue='\\N') \
.csv(DATA_BUCKET_FOLDER + "clicks_train.csv") \
.alias('clicks_train')
clicks_train_joined_df = clicks_train_df \
.join(promoted_content_df, on='ad_id', how='left') \
.join(events_df, on='display_id', how='left')
clicks_train_joined_df.createOrReplaceTempView('clicks_train_joined')
validation_display_ids_df = clicks_train_joined_df.select('display_id', 'day_event') \
.distinct() \
.sampleBy("day_event", fractions={0: 0.2, 1: 0.2, 2: 0.2, 3: 0.2, 4: 0.2,
5: 0.2, 6: 0.2, 7: 0.2, 8: 0.2, 9: 0.2, 10: 0.2, 11: 1.0, 12: 1.0}, seed=0)
validation_display_ids_df.createOrReplaceTempView("validation_display_ids")
validation_set_df = spark.sql('''SELECT display_id, ad_id, uuid_event, day_event,
timestamp_event, document_id_promo, platform_event, geo_location_event
FROM clicks_train_joined t
WHERE EXISTS (SELECT display_id FROM validation_display_ids
WHERE display_id = t.display_id)''')
validation_set_gcs_output = "validation_set.parquet"
validation_set_df.write.parquet(OUTPUT_BUCKET_FOLDER + validation_set_gcs_output, mode='overwrite')
print(validation_set_df.take(5))
spark.stop()

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,474 @@
#!/usr/bin/env python
# coding: utf-8
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import datetime
import numpy as np
import pandas as pd
import pyspark.sql.functions as F
import tensorflow as tf
from pyspark import TaskContext
from pyspark.context import SparkContext, SparkConf
from pyspark.sql.functions import col, udf
from pyspark.sql.session import SparkSession
from pyspark.sql.types import ArrayType, DoubleType
from tensorflow_transform.tf_metadata import dataset_metadata
from tensorflow_transform.tf_metadata import dataset_schema
from tensorflow_transform.tf_metadata import metadata_io
from data.outbrain.features import PREBATCH_SIZE, HASH_BUCKET_SIZES
from data.outbrain.spark.utils.feature_description import LABEL_COLUMN, DISPLAY_ID_COLUMN, CATEGORICAL_COLUMNS, \
DOC_CATEGORICAL_MULTIVALUED_COLUMNS, BOOL_COLUMNS, INT_COLUMNS, FLOAT_COLUMNS, \
FLOAT_COLUMNS_LOG_BIN_TRANSFORM, FLOAT_COLUMNS_SIMPLE_BIN_TRANSFORM, FLOAT_COLUMNS_NO_TRANSFORM
pd.set_option('display.max_columns', 1000)
evaluation = True
evaluation_verbose = False
OUTPUT_BUCKET_FOLDER = "/tmp/spark/preprocessed/"
DATA_BUCKET_FOLDER = "/data/orig/"
SPARK_TEMP_FOLDER = "/tmp/spark/spark-temp/"
LOCAL_DATA_TFRECORDS_DIR = "/outbrain/tfrecords"
TEST_SET_MODE = False
TENSORFLOW_HADOOP = "data/outbrain/spark/data/tensorflow-hadoop-1.5.0.jar"
conf = SparkConf().setMaster('local[*]').set('spark.executor.memory', '40g').set('spark.driver.memory', '200g').set(
"spark.local.dir", SPARK_TEMP_FOLDER)
conf.set("spark.jars", TENSORFLOW_HADOOP)
conf.set("spark.sql.files.maxPartitionBytes", 805306368)
sc = SparkContext(conf=conf)
spark = SparkSession(sc)
parser = argparse.ArgumentParser()
parser.add_argument(
'--num_train_partitions',
help='number of train partitions',
type=int,
default=40)
parser.add_argument(
'--num_valid_partitions',
help='number of validation partitions',
type=int,
default=40)
args = parser.parse_args()
num_train_partitions = args.num_train_partitions
num_valid_partitions = args.num_valid_partitions
batch_size = PREBATCH_SIZE
# # Feature Vector export
bool_feature_names = []
int_feature_names = ['ad_views',
'doc_views',
'doc_event_days_since_published',
'doc_ad_days_since_published',
]
float_feature_names = [
'pop_ad_id',
'pop_document_id',
'pop_publisher_id',
'pop_advertiser_id',
'pop_campain_id',
'pop_source_id',
'doc_event_doc_ad_sim_categories',
'doc_event_doc_ad_sim_topics',
'doc_event_doc_ad_sim_entities',
]
TRAFFIC_SOURCE_FV = 'traffic_source'
EVENT_HOUR_FV = 'event_hour'
EVENT_COUNTRY_FV = 'event_country'
EVENT_COUNTRY_STATE_FV = 'event_country_state'
EVENT_GEO_LOCATION_FV = 'event_geo_location'
EVENT_PLATFORM_FV = 'event_platform'
AD_ADVERTISER_FV = 'ad_advertiser'
DOC_AD_SOURCE_ID_FV = 'doc_ad_source_id'
DOC_AD_PUBLISHER_ID_FV = 'doc_ad_publisher_id'
DOC_EVENT_SOURCE_ID_FV = 'doc_event_source_id'
DOC_EVENT_PUBLISHER_ID_FV = 'doc_event_publisher_id'
DOC_AD_CATEGORY_ID_FV = 'doc_ad_category_id'
DOC_AD_TOPIC_ID_FV = 'doc_ad_topic_id'
DOC_AD_ENTITY_ID_FV = 'doc_ad_entity_id'
DOC_EVENT_CATEGORY_ID_FV = 'doc_event_category_id'
DOC_EVENT_TOPIC_ID_FV = 'doc_event_topic_id'
DOC_EVENT_ENTITY_ID_FV = 'doc_event_entity_id'
# ### Configuring feature vector
category_feature_names_integral = ['ad_advertiser',
'doc_ad_publisher_id',
'doc_ad_source_id',
'doc_event_publisher_id',
'doc_event_source_id',
'event_country',
'event_country_state',
'event_geo_location',
'event_hour',
'event_platform',
'traffic_source']
feature_vector_labels_integral = bool_feature_names \
+ int_feature_names \
+ float_feature_names \
+ category_feature_names_integral
train_feature_vector_gcs_folder_name = 'train_feature_vectors_integral_eval'
# ## Exporting integral feature vectors to CSV
train_feature_vectors_exported_df = spark.read.parquet(OUTPUT_BUCKET_FOLDER + train_feature_vector_gcs_folder_name)
train_feature_vectors_exported_df.take(3)
integral_headers = ['label', 'display_id', 'ad_id', 'doc_id', 'doc_event_id'] + feature_vector_labels_integral
CSV_ORDERED_COLUMNS = ['label', 'display_id', 'ad_id', 'doc_id', 'doc_event_id', 'ad_views', 'campaign_id','doc_views',
'doc_event_days_since_published', 'doc_ad_days_since_published',
'pop_ad_id', 'pop_document_id', 'pop_publisher_id', 'pop_advertiser_id', 'pop_campain_id',
'pop_source_id',
'doc_event_doc_ad_sim_categories', 'doc_event_doc_ad_sim_topics',
'doc_event_doc_ad_sim_entities', 'ad_advertiser', 'doc_ad_publisher_id',
'doc_ad_source_id', 'doc_event_publisher_id', 'doc_event_source_id', 'event_country',
'event_country_state', 'event_geo_location', 'event_platform',
'traffic_source']
FEAT_CSV_ORDERED_COLUMNS = ['ad_views', 'campaign_id','doc_views',
'doc_event_days_since_published', 'doc_ad_days_since_published',
'pop_ad_id', 'pop_document_id', 'pop_publisher_id', 'pop_advertiser_id', 'pop_campain_id',
'pop_source_id',
'doc_event_doc_ad_sim_categories', 'doc_event_doc_ad_sim_topics',
'doc_event_doc_ad_sim_entities', 'ad_advertiser', 'doc_ad_publisher_id',
'doc_ad_source_id', 'doc_event_publisher_id', 'doc_event_source_id', 'event_country',
'event_country_state', 'event_geo_location', 'event_platform',
'traffic_source']
def to_array(col):
def to_array_(v):
return v.toArray().tolist()
# Important: asNondeterministic requires Spark 2.3 or later
# It can be safely removed i.e.
# return udf(to_array_, ArrayType(DoubleType()))(col)
# but at the cost of decreased performance
return udf(to_array_, ArrayType(DoubleType())).asNondeterministic()(col)
CONVERT_TO_INT = ['doc_ad_category_id_1',
'doc_ad_category_id_2', 'doc_ad_category_id_3', 'doc_ad_topic_id_1', 'doc_ad_topic_id_2',
'doc_ad_topic_id_3', 'doc_ad_entity_id_1', 'doc_ad_entity_id_2', 'doc_ad_entity_id_3',
'doc_ad_entity_id_4', 'doc_ad_entity_id_5', 'doc_ad_entity_id_6',
'doc_ad_source_id', 'doc_event_category_id_1', 'doc_event_category_id_2', 'doc_event_category_id_3',
'doc_event_topic_id_1', 'doc_event_topic_id_2', 'doc_event_topic_id_3', 'doc_event_entity_id_1',
'doc_event_entity_id_2', 'doc_event_entity_id_3', 'doc_event_entity_id_4', 'doc_event_entity_id_5',
'doc_event_entity_id_6']
def format_number(element, name):
if name in BOOL_COLUMNS + CATEGORICAL_COLUMNS:
return element.cast("int")
elif name in CONVERT_TO_INT:
return element.cast("int")
else:
return element
def to_array_with_none(col):
def to_array_with_none_(v):
tmp = np.full((v.size,), fill_value=None, dtype=np.float64)
tmp[v.indices] = v.values
return tmp.tolist()
# Important: asNondeterministic requires Spark 2.3 or later
# It can be safely removed i.e.
# return udf(to_array_, ArrayType(DoubleType()))(col)
# but at the cost of decreased performance
return udf(to_array_with_none_, ArrayType(DoubleType())).asNondeterministic()(col)
@udf
def count_value(x):
from collections import Counter
tmp = Counter(x).most_common(2)
if not tmp or np.isnan(tmp[0][0]):
return 0
return float(tmp[0][0])
def replace_with_most_frequent(most_value):
return udf(lambda x: most_value if not x or np.isnan(x) else x)
train_feature_vectors_integral_csv_rdd_df = train_feature_vectors_exported_df.select('label', 'display_id', 'ad_id',
'document_id', 'document_id_event',
'feature_vector').withColumn(
"featvec", to_array("feature_vector")).select(
['label'] + ['display_id'] + ['ad_id'] + ['document_id'] + ['document_id_event'] + [
format_number(element, FEAT_CSV_ORDERED_COLUMNS[index]).alias(FEAT_CSV_ORDERED_COLUMNS[index]) for
index, element in enumerate([col("featvec")[i] for i in range(len(feature_vector_labels_integral))])]).replace(
float('nan'), 0)
test_validation_feature_vector_gcs_folder_name = 'validation_feature_vectors_integral'
# ## Exporting integral feature vectors
test_validation_feature_vectors_exported_df = spark.read.parquet(
OUTPUT_BUCKET_FOLDER + test_validation_feature_vector_gcs_folder_name)
test_validation_feature_vectors_exported_df = test_validation_feature_vectors_exported_df.repartition(40,
'display_id').orderBy(
'display_id')
test_validation_feature_vectors_exported_df.take(3)
test_validation_feature_vectors_integral_csv_rdd_df = test_validation_feature_vectors_exported_df.select(
'label', 'display_id', 'ad_id', 'document_id', 'document_id_event', 'feature_vector').withColumn("featvec",
to_array(
"feature_vector")).select(
['label'] + ['display_id'] + ['ad_id'] + ['document_id'] + ['document_id_event'] + [
format_number(element, FEAT_CSV_ORDERED_COLUMNS[index]).alias(FEAT_CSV_ORDERED_COLUMNS[index]) for
index, element in enumerate([col("featvec")[i] for i in range(len(feature_vector_labels_integral))])]).replace(
float('nan'), 0)
def make_spec(output_dir, batch_size=None):
fixed_shape = [batch_size, 1] if batch_size is not None else []
spec = {}
spec[LABEL_COLUMN] = tf.io.FixedLenFeature(shape=fixed_shape, dtype=tf.int64, default_value=None)
spec[DISPLAY_ID_COLUMN] = tf.io.FixedLenFeature(shape=fixed_shape, dtype=tf.int64, default_value=None)
for name in BOOL_COLUMNS:
spec[name] = tf.io.FixedLenFeature(shape=fixed_shape, dtype=tf.int64, default_value=None)
for name in FLOAT_COLUMNS_LOG_BIN_TRANSFORM + FLOAT_COLUMNS_SIMPLE_BIN_TRANSFORM + FLOAT_COLUMNS_NO_TRANSFORM:
spec[name] = tf.io.FixedLenFeature(shape=fixed_shape, dtype=tf.float32, default_value=None)
for name in FLOAT_COLUMNS_SIMPLE_BIN_TRANSFORM:
spec[name + '_binned'] = tf.io.FixedLenFeature(shape=fixed_shape, dtype=tf.int64, default_value=None)
for name in FLOAT_COLUMNS_LOG_BIN_TRANSFORM:
spec[name + '_log_01scaled'] = tf.io.FixedLenFeature(shape=fixed_shape, dtype=tf.float32, default_value=None)
for name in INT_COLUMNS:
spec[name + '_log_01scaled'] = tf.io.FixedLenFeature(shape=fixed_shape, dtype=tf.float32, default_value=None)
for name in BOOL_COLUMNS + CATEGORICAL_COLUMNS:
spec[name] = tf.io.FixedLenFeature(shape=fixed_shape, dtype=tf.int64, default_value=None)
for multi_category in DOC_CATEGORICAL_MULTIVALUED_COLUMNS:
shape = fixed_shape[:-1] + [len(DOC_CATEGORICAL_MULTIVALUED_COLUMNS[multi_category])]
spec[multi_category] = tf.io.FixedLenFeature(shape=shape, dtype=tf.int64)
metadata = dataset_metadata.DatasetMetadata(dataset_schema.from_feature_spec(spec))
metadata_io.write_metadata(metadata, output_dir)
# write out tfrecords meta
make_spec(LOCAL_DATA_TFRECORDS_DIR + '/transformed_metadata', batch_size=batch_size)
def log2_1p(x):
return np.log1p(x) / np.log(2.0)
# calculate min and max stats for the given dataframes all in one go
def compute_min_max_logs(df):
print(str(datetime.datetime.now()) + '\tComputing min and max')
min_logs = {}
max_logs = {}
float_expr = []
for name in FLOAT_COLUMNS_LOG_BIN_TRANSFORM + INT_COLUMNS:
float_expr.append(F.min(name))
float_expr.append(F.max(name))
floatDf = all_df.agg(*float_expr).collect()
for name in FLOAT_COLUMNS_LOG_BIN_TRANSFORM:
minAgg = floatDf[0]["min(" + name + ")"]
maxAgg = floatDf[0]["max(" + name + ")"]
min_logs[name + '_log_01scaled'] = log2_1p(minAgg * 1000)
max_logs[name + '_log_01scaled'] = log2_1p(maxAgg * 1000)
for name in INT_COLUMNS:
minAgg = floatDf[0]["min(" + name + ")"]
maxAgg = floatDf[0]["max(" + name + ")"]
min_logs[name + '_log_01scaled'] = log2_1p(minAgg)
max_logs[name + '_log_01scaled'] = log2_1p(maxAgg)
return min_logs, max_logs
all_df = test_validation_feature_vectors_integral_csv_rdd_df.union(train_feature_vectors_integral_csv_rdd_df)
min_logs, max_logs = compute_min_max_logs(all_df)
train_output_string = '/train'
eval_output_string = '/eval'
path = LOCAL_DATA_TFRECORDS_DIR
def create_tf_example_spark(df, min_logs, max_logs):
result = {}
result[LABEL_COLUMN] = tf.train.Feature(int64_list=tf.train.Int64List(value=df[LABEL_COLUMN].to_list()))
result[DISPLAY_ID_COLUMN] = tf.train.Feature(int64_list=tf.train.Int64List(value=df[DISPLAY_ID_COLUMN].to_list()))
for name in FLOAT_COLUMNS:
value = df[name].to_list()
result[name] = tf.train.Feature(float_list=tf.train.FloatList(value=value))
for name in FLOAT_COLUMNS_SIMPLE_BIN_TRANSFORM:
value = df[name].multiply(10).astype('int64').to_list()
result[name + '_binned'] = tf.train.Feature(int64_list=tf.train.Int64List(value=value))
for name in FLOAT_COLUMNS_LOG_BIN_TRANSFORM:
value_prelim = df[name].multiply(1000).apply(np.log1p).multiply(1. / np.log(2.0))
value = value_prelim.astype('int64').to_list()
result[name + '_binned'] = tf.train.Feature(int64_list=tf.train.Int64List(value=value))
nn = name + '_log_01scaled'
value = value_prelim.add(-min_logs[nn]).multiply(1. / (max_logs[nn] - min_logs[nn])).to_list()
result[nn] = tf.train.Feature(float_list=tf.train.FloatList(value=value))
for name in INT_COLUMNS:
value_prelim = df[name].apply(np.log1p).multiply(1. / np.log(2.0))
value = value_prelim.astype('int64').to_list()
result[name + '_log_int'] = tf.train.Feature(int64_list=tf.train.Int64List(value=value))
nn = name + '_log_01scaled'
value = value_prelim.add(-min_logs[nn]).multiply(1. / (max_logs[nn] - min_logs[nn])).to_list()
result[nn] = tf.train.Feature(float_list=tf.train.FloatList(value=value))
for name in BOOL_COLUMNS + CATEGORICAL_COLUMNS:
value = df[name].fillna(0).astype('int64').to_list()
result[name] = tf.train.Feature(int64_list=tf.train.Int64List(value=value))
for multi_category in DOC_CATEGORICAL_MULTIVALUED_COLUMNS:
values = []
for category in DOC_CATEGORICAL_MULTIVALUED_COLUMNS[multi_category]:
values = values + [df[category].to_numpy()]
# need to transpose the series so they will be parsed correctly by the FixedLenFeature
# we can pass in a single series here; they'll be reshaped to [batch_size, num_values]
# when parsed from the TFRecord
value = np.stack(values, axis=1).flatten().tolist()
result[multi_category] = tf.train.Feature(int64_list=tf.train.Int64List(value=value))
tf_example = tf.train.Example(features=tf.train.Features(feature=result))
return tf_example
def hash_bucket(num_buckets):
return lambda x: x % num_buckets
def _transform_to_tfrecords(rdds):
csv = pd.DataFrame(list(rdds), columns=CSV_ORDERED_COLUMNS)
num_rows = len(csv.index)
examples = []
for start_ind in range(0, num_rows, batch_size if batch_size is not None else 1): # for each batch
if start_ind + batch_size - 1 > num_rows: # if we'd run out of rows
csv_slice = csv.iloc[start_ind:]
# drop the remainder
print("last Example has: ", len(csv_slice))
examples.append((create_tf_example_spark(csv_slice, min_logs, max_logs), len(csv_slice)))
return examples
else:
csv_slice = csv.iloc[start_ind:start_ind + (batch_size if batch_size is not None else 1)]
examples.append((create_tf_example_spark(csv_slice, min_logs, max_logs), batch_size))
return examples
max_partition_num = 30
def _transform_to_slices(rdds):
taskcontext = TaskContext.get()
partitionid = taskcontext.partitionId()
csv = pd.DataFrame(list(rdds), columns=CSV_ORDERED_COLUMNS)
for name, size in HASH_BUCKET_SIZES.items():
if name in csv.columns.values:
csv[name] = csv[name].apply(hash_bucket(size))
num_rows = len(csv.index)
print("working with partition: ", partitionid, max_partition_num, num_rows)
examples = []
for start_ind in range(0, num_rows, batch_size if batch_size is not None else 1): # for each batch
if start_ind + batch_size - 1 > num_rows: # if we'd run out of rows
csv_slice = csv.iloc[start_ind:]
print("last Example has: ", len(csv_slice), partitionid)
examples.append((csv_slice, len(csv_slice)))
return examples
else:
csv_slice = csv.iloc[start_ind:start_ind + (batch_size if batch_size is not None else 1)]
examples.append((csv_slice, len(csv_slice)))
return examples
def _transform_to_tfrecords_from_slices(rdds):
examples = []
for slice in rdds:
if len(slice[0]) != batch_size:
print("slice size is not correct, dropping: ", len(slice[0]))
else:
examples.append(
(bytearray((create_tf_example_spark(slice[0], min_logs, max_logs)).SerializeToString()), None))
return examples
def _transform_to_tfrecords_from_reslice(rdds):
examples = []
all_dataframes = pd.DataFrame([])
for slice in rdds:
all_dataframes = all_dataframes.append(slice[0])
num_rows = len(all_dataframes.index)
examples = []
for start_ind in range(0, num_rows, batch_size if batch_size is not None else 1): # for each batch
if start_ind + batch_size - 1 > num_rows: # if we'd run out of rows
csv_slice = all_dataframes.iloc[start_ind:]
if TEST_SET_MODE:
remain_len = batch_size - len(csv_slice)
(m, n) = divmod(remain_len, len(csv_slice))
print("remainder: ", len(csv_slice), remain_len, m, n)
if m:
for i in range(m):
csv_slice = csv_slice.append(csv_slice)
csv_slice = csv_slice.append(csv_slice.iloc[:n])
print("after fill remainder: ", len(csv_slice))
examples.append(
(bytearray((create_tf_example_spark(csv_slice, min_logs, max_logs)).SerializeToString()), None))
return examples
# drop the remainder
print("dropping remainder: ", len(csv_slice))
return examples
else:
csv_slice = all_dataframes.iloc[start_ind:start_ind + (batch_size if batch_size is not None else 1)]
examples.append(
(bytearray((create_tf_example_spark(csv_slice, min_logs, max_logs)).SerializeToString()), None))
return examples
TEST_SET_MODE = False
train_features = train_feature_vectors_integral_csv_rdd_df.coalesce(30).rdd.mapPartitions(_transform_to_slices)
cached_train_features = train_features.cache()
train_full = cached_train_features.filter(lambda x: x[1] == batch_size)
# split out slies where we don't have a full batch so that we can reslice them so we only drop mininal rows
train_not_full = cached_train_features.filter(lambda x: x[1] < batch_size)
train_examples_full = train_full.mapPartitions(_transform_to_tfrecords_from_slices)
train_left = train_not_full.coalesce(1).mapPartitions(_transform_to_tfrecords_from_reslice)
all_train = train_examples_full.union(train_left)
TEST_SET_MODE = True
valid_features = test_validation_feature_vectors_integral_csv_rdd_df.repartition(num_valid_partitions,
'display_id').rdd.mapPartitions(
_transform_to_slices)
cached_valid_features = valid_features.cache()
valid_full = cached_valid_features.filter(lambda x: x[1] == batch_size)
valid_not_full = cached_valid_features.filter(lambda x: x[1] < batch_size)
valid_examples_full = valid_full.mapPartitions(_transform_to_tfrecords_from_slices)
valid_left = valid_not_full.coalesce(1).mapPartitions(_transform_to_tfrecords_from_reslice)
all_valid = valid_examples_full.union(valid_left)
all_train.saveAsNewAPIHadoopFile(LOCAL_DATA_TFRECORDS_DIR + train_output_string,
"org.tensorflow.hadoop.io.TFRecordFileOutputFormat",
keyClass="org.apache.hadoop.io.BytesWritable",
valueClass="org.apache.hadoop.io.NullWritable")
all_valid.saveAsNewAPIHadoopFile(LOCAL_DATA_TFRECORDS_DIR + eval_output_string,
"org.tensorflow.hadoop.io.TFRecordFileOutputFormat",
keyClass="org.apache.hadoop.io.BytesWritable",
valueClass="org.apache.hadoop.io.NullWritable")
spark.stop()

View file

@ -0,0 +1,136 @@
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
LABEL_COLUMN = "label"
DISPLAY_ID_COLUMN = 'display_id'
IS_LEAK_COLUMN = 'is_leak'
DISPLAY_ID_AND_IS_LEAK_ENCODED_COLUMN = 'display_ad_and_is_leak'
CATEGORICAL_COLUMNS = [
'ad_id',
'campaign_id',
'doc_id',
'doc_event_id',
'ad_advertiser',
'doc_ad_source_id',
'doc_ad_publisher_id',
'doc_event_publisher_id',
'doc_event_source_id',
'event_country',
'event_country_state',
'event_geo_location',
'event_platform']
DOC_CATEGORICAL_MULTIVALUED_COLUMNS = {
}
BOOL_COLUMNS = []
INT_COLUMNS = [
'ad_views',
'doc_views',
'doc_event_days_since_published',
'doc_ad_days_since_published']
FLOAT_COLUMNS_LOG_BIN_TRANSFORM = []
FLOAT_COLUMNS_NO_TRANSFORM = [
'pop_ad_id',
'pop_document_id',
'pop_publisher_id',
'pop_advertiser_id',
'pop_campain_id',
'pop_source_id',
'doc_event_doc_ad_sim_categories',
'doc_event_doc_ad_sim_topics',
'doc_event_doc_ad_sim_entities',
]
FLOAT_COLUMNS_SIMPLE_BIN_TRANSFORM = []
FLOAT_COLUMNS = FLOAT_COLUMNS_LOG_BIN_TRANSFORM + FLOAT_COLUMNS_SIMPLE_BIN_TRANSFORM + FLOAT_COLUMNS_NO_TRANSFORM
REQUEST_SINGLE_HOT_COLUMNS = [
"doc_event_id",
"doc_id",
"doc_event_source_id",
"event_geo_location",
"event_country_state",
"doc_event_publisher_id",
"event_country",
"event_hour",
"event_platform",
"traffic_source",
"event_weekend",
"user_has_already_viewed_doc"]
REQUEST_MULTI_HOT_COLUMNS = [
"doc_event_entity_id",
"doc_event_topic_id",
"doc_event_category_id"]
REQUEST_NUMERIC_COLUMNS = [
"pop_document_id_conf",
"pop_publisher_id_conf",
"pop_source_id_conf",
"pop_entity_id_conf",
"pop_topic_id_conf",
"pop_category_id_conf",
"pop_document_id",
"pop_publisher_id",
"pop_source_id",
"pop_entity_id",
"pop_topic_id",
"pop_category_id",
"user_views",
"doc_views",
"doc_event_days_since_published",
"doc_event_hour"]
ITEM_SINGLE_HOT_COLUMNS = [
"ad_id",
'campaign_id',
"doc_ad_source_id",
"ad_advertiser",
"doc_ad_publisher_id"]
ITEM_MULTI_HOT_COLUMNS = [
"doc_ad_topic_id",
"doc_ad_entity_id",
"doc_ad_category_id"]
ITEM_NUMERIC_COLUMNS = [
"pop_ad_id_conf",
"user_doc_ad_sim_categories_conf",
"user_doc_ad_sim_topics_conf",
"pop_advertiser_id_conf",
"pop_ad_id",
"pop_advertiser_id",
"pop_campain_id",
"user_doc_ad_sim_categories",
"user_doc_ad_sim_topics",
"user_doc_ad_sim_entities",
"doc_event_doc_ad_sim_categories",
"doc_event_doc_ad_sim_topics",
"doc_event_doc_ad_sim_entities",
"ad_views",
"doc_ad_days_since_published"]
NV_TRAINING_COLUMNS = (
REQUEST_SINGLE_HOT_COLUMNS +
REQUEST_MULTI_HOT_COLUMNS +
REQUEST_NUMERIC_COLUMNS +
ITEM_SINGLE_HOT_COLUMNS +
ITEM_MULTI_HOT_COLUMNS +
ITEM_NUMERIC_COLUMNS)

File diff suppressed because it is too large Load diff

After

Width:  |  Height:  |  Size: 62 KiB

File diff suppressed because it is too large Load diff

After

Width:  |  Height:  |  Size: 67 KiB

File diff suppressed because it is too large Load diff

After

Width:  |  Height:  |  Size: 53 KiB

File diff suppressed because it is too large Load diff

After

Width:  |  Height:  |  Size: 70 KiB

File diff suppressed because it is too large Load diff

After

Width:  |  Height:  |  Size: 51 KiB

File diff suppressed because it is too large Load diff

After

Width:  |  Height:  |  Size: 67 KiB

File diff suppressed because it is too large Load diff

After

Width:  |  Height:  |  Size: 113 KiB

File diff suppressed because it is too large Load diff

After

Width:  |  Height:  |  Size: 74 KiB

File diff suppressed because it is too large Load diff

After

Width:  |  Height:  |  Size: 72 KiB

View file

@ -0,0 +1,33 @@
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from trainer.model.widedeep import wide_deep_model
from trainer.run import train, evaluate
from trainer.utils.arguments import parse_args
from trainer.utils.setup import create_config
def main():
args = parse_args()
config = create_config(args)
model = wide_deep_model(args)
if args.evaluate:
evaluate(args, model, config)
else:
train(args, model, config)
if __name__ == '__main__':
main()

Some files were not shown because too many files have changed in this diff Show more