Merge pull request #806 from NVIDIA/gh/release

[nnUNet/PyT] Release
2021-01-09 00:11:54 +01:00 · 2021-01-09 00:11:54 +01:00 · 2badf6e8e4
parent 3f9cdc6290 d646ec9268
commit 2badf6e8e4
23 changed files with 2749 additions and 0 deletions
--- a/PyTorch/Segmentation/nnUNet/Dockerfile
+++ b/PyTorch/Segmentation/nnUNet/Dockerfile
@ -0,0 +1,16 @@
+ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:20.12-py3
+FROM ${FROM_IMAGE_NAME}
+
+ADD . /workspace/nnunet_pyt
+WORKDIR /workspace/nnunet_pyt
+
+RUN pip install --upgrade pip
+RUN pip install --disable-pip-version-check -r requirements.txt
+RUN pip install pytorch-lightning==1.0.0 --no-dependencies
+RUN pip install monai==0.4.0 --no-dependencies
+RUN pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==0.29.0
+RUN pip install torch_optimizer==0.0.1a15 --no-dependencies
+RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
+RUN unzip awscliv2.zip
+RUN ./aws/install
+RUN rm -rf awscliv2.zip aws
--- a/PyTorch/Segmentation/nnUNet/README.md
+++ b/PyTorch/Segmentation/nnUNet/README.md
@ -0,0 +1,706 @@
+# nnU-Net For PyTorch
+
+This repository provides a script and recipe to train the nnU-Net model to achieve state-of-the-art accuracy and is tested and maintained by NVIDIA.
+
+## Table Of Contents
+
+- [Model overview](#model-overview)
+    * [Model architecture](#model-architecture)
+    * [Default configuration](#default-configuration)
+    * [Feature support matrix](#feature-support-matrix)
+        * [Features](#features)
+    * [Mixed precision training](#mixed-precision-training)
+        * [Enabling mixed precision](#enabling-mixed-precision)
+        * [TF32](#tf32)
+    * [Glossary](#glossary)
+- [Setup](#setup)
+    * [Requirements](#requirements)
+- [Quick Start Guide](#quick-start-guide)
+- [Advanced](#advanced)
+    * [Scripts and sample code](#scripts-and-sample-code)
+    * [Parameters](#parameters)
+    * [Command-line options](#command-line-options)
+    * [Getting the data](#getting-the-data)
+        * [Dataset guidelines](#dataset-guidelines)
+        * [Multi-dataset](#multi-dataset)
+    * [Training process](#training-process)
+    * [Inference process](#inference-process)
+- [Performance](#performance)
+    * [Benchmarking](#benchmarking)
+        * [Training performance benchmark](#training-performance-benchmark)
+        * [Inference performance benchmark](#inference-performance-benchmark)
+    * [Results](#results)
+        * [Training accuracy results](#training-accuracy-results)             
+            * [Training accuracy: NVIDIA DGX-1 (8x V100 16GB)](#training-accuracy-nvidia-dgx-1-8x-v100-16gb)
+        * [Training performance results](#training-performance-results)
+            * [Training performance: NVIDIA DGX A100 (8x A100 40GB)](#training-performance-nvidia-dgx-a100-8x-a100-40gb) 
+            * [Training performance: NVIDIA DGX-1 (8x V100 16GB)](#training-performance-nvidia-dgx-1-8x-v100-16gb)
+            * [Training performance: NVIDIA DGX-2 (16x V100 32GB)](#training-performance-nvidia-dgx-2-16x-v100-32gb)
+        * [Inference performance results](#inference-performance-results)
+            * [Inference performance: NVIDIA DGX A100 (1x A100 40GB)](#inference-performance-nvidia-dgx-a100-1x-a100-40gb)
+            * [Inference performance: NVIDIA DGX-1 (1x V100 16GB)](#inference-performance-nvidia-dgx-1-1x-v100-16gb)
+- [Release notes](#release-notes)
+    * [Changelog](#changelog)
+    * [Known issues](#known-issues)
+
+## Model overview
+    
+The nnU-Net ("no-new-Net") refers to a robust and self-adapting framework for U-Net based medical image segmentation. This repository contains a nnU-Net implementation as described in the paper: [nnU-Net: Self-adapting Framework for U-Net-Based Medical Image Segmentation](https://arxiv.org/abs/1809.10486).
+    
+The differences between this nnU-net and [original model](https://github.com/MIC-DKFZ/nnUNet) are:
+  - Dynamic selection of patch size and spacings for low resolution U-Net are not supported and they need to be set in `data_preprocessing/configs.py` file.
+  - Cascaded U-Net is not supported.
+  - The following data augmentations are not used: rotation, simulation of low resolution, gamma augmentation.
+    
+This model is trained with mixed precision using Tensor Cores on Volta, Turing, and the NVIDIA Ampere GPU architectures. Therefore, researchers can get results 2x faster than training without Tensor Cores, while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.
+
+### Model architecture
+    
+The nnU-Net allows training two types of networks: 2D U-Net and 3D U-Net to perform semantic segmentation of 3D images, with high accuracy and performance.
+    
+The following figure shows the architecture of the 3D U-Net model and its different components. U-Net is composed of a contractive and an expanding path, that aims at building a bottleneck in its center-most part through a combination of convolution, instance norm and leaky relu operations. After this bottleneck, the image is reconstructed through a combination of convolutions and upsampling. Skip connections are added with the goal of helping the backward flow of gradients in order to improve the training.
+
+<img src="images/unet3d.png" width="900"/>
+    
+*Figure 1: The 3D U-Net architecture*
+
+### Default configuration
+
+All convolution blocks in U-Net in both encoder and decoder are using two convolution layers followed by instance normalization and a leaky ReLU nonlinearity. For downsampling we are using strided convolution whereas transposed convolution for upsampling.
+
+All models were trained with RAdam optimizer, learning rate 0.001 and weight_decay 0.0001. For loss function we use the average of [cross-entropy](https://en.wikipedia.org/wiki/Cross_entropy) and [dice coefficient](https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient).
+
+Early stopping is triggered if validation dice score wasn't improved during the last 100 epochs.
+
+Used data augmentation: crop with oversampling the foreground class, mirroring, zoom, gaussian noise, gaussian blur, brightness.
+
+### Feature support matrix
+
+The following features are supported by this model: 
+
+| Feature               | nnUNet               
+|-----------------------|--------------------------   
+|[DALI](https://docs.nvidia.com/deeplearning/dali/release-notes/index.html) | Yes
+|Automatic mixed precision (AMP)   | Yes  
+|Distributed data parallel (DDP)   | Yes
+         
+#### Features
+
+**DALI**
+
+NVIDIA DALI - DALI is a library accelerating data preparation pipeline. To accelerate your input pipeline, you only need to define your data loader
+with the DALI library. For details, see example sources in this repository or see the [DALI documentation](https://docs.nvidia.com/deeplearning/dali/index.html)
+
+**Automatic Mixed Precision (AMP)**
+
+This implementation uses native PyTorch AMP implementation of mixed precision training. It allows us to use FP16 training with FP32 master weights by modifying just a few lines of code.
+
+**DistributedDataParallel (DDP)**
+
+The model uses PyTorch Lightning implementation of distributed data parallelism at the module level which can run across multiple machines.
+    
+### Mixed precision training
+
+Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores) in Volta, and following with both the Turing and Ampere architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:
+
+1. Porting the model to use the FP16 data type where appropriate.
+2. Adding loss scaling to preserve small gradient values.
+
+The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in [CUDA 8](https://devblogs.nvidia.com/parallelforall/tag/fp16/) in the NVIDIA Deep Learning SDK.
+
+For information about:
+* How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html) documentation.
+* Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog.
+* APEX tools for mixed precision training, see the [NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch](https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/).
+
+#### Enabling mixed precision
+    
+For training and inference, mixed precision can be enabled by adding the `--amp` flag. Mixed precision is using [native Pytorch implementation](https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/).
+
+#### TF32
+
+TensorFloat-32 (TF32) is the new math mode in [NVIDIA A100](https://www.nvidia.com/en-us/data-center/a100/) GPUs for handling the matrix math also called tensor operations. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on Volta GPUs. 
+
+TF32 Tensor Cores can speed up networks using FP32, typically with no loss of accuracy. It is more robust than FP16 for models which require high dynamic range for weights or activations.
+
+For more information, refer to the [TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/) blog post.
+
+TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by default.
+
+### Glossary
+
+**Test time augmentation**
+
+Test time augmentation is an inference technique which averages predictions from augmented images with its prediction. As a result, predictions are more accurate, but with the cost of slower inference process. For nnU-Net, we use all possible flip combinations for image augmenting. Test time augmentation can be enabled by adding the `--tta` flag.
+
+**Deep supervision**
+
+Deep supervision is a technique which adds auxiliary loss in U-Net decoder. For nnU-Net, we add auxiliary losses to all but the lowest two decoder levels. Final loss is the weighted average of losses. Deep supervision can be enabled by adding the `--deep_supervision` flag.
+
+## Setup
+
+The following section lists the requirements that you need to meet in order to start training the nnU-Net model.
+
+### Requirements
+
+This repository contains Dockerfile which extends the PyTorch NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
+-   [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
+-   PyTorch 20.12 NGC container
+-   Supported GPUs:
+    - [NVIDIA Volta architecture](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
+    - [NVIDIA Turing architecture](https://www.nvidia.com/en-us/geforce/turing/)
+    - [NVIDIA Ampere architecture](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/)
+
+For more information about how to get started with NGC containers, see the following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning Documentation:
+-   [Getting Started Using NVIDIA GPU Cloud](https://docs.nvidia.com/ngc/ngc-getting-started-guide/index.html)
+-   [Accessing And Pulling From The NGC Container Registry](https://docs.nvidia.com/deeplearning/frameworks/user-guide/index.html#accessing_registry)
+-   Running [PyTorch](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/running.html#running)
+  
+For those unable to use the PyTorch NGC container, to set up the required environment or create your own container, see the versioned [NVIDIA Container Support Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html).
+
+## Quick Start Guide
+
+To train your model using mixed or TF32 precision with Tensor Cores or using FP32, perform the following steps using the default parameters of the nnUNet model on the [Medical Segmentation Decathlon](http://medicaldecathlon.com/) dataset. For the specifics concerning training and inference, see the [Advanced](#advanced) section.
+
+1. Clone the repository.
+
+Executing this command will create your local repository with all the code to run nnU-Net.
+```
+git clone https://github.com/NVIDIA/DeepLearningExamples
+cd DeepLearningExamples/Pytorch/Segmentation/nnunet_pyt
+```
+    
+2. Build the nnU-Net PyTorch NGC container.
+    
+This command will use the Dockerfile to create a Docker image named `nnunet_pyt`, downloading all the required components automatically.
+
+```
+docker build -t nnunet_pyt .
+```
+    
+The NGC container contains all the components optimized for usage on NVIDIA hardware.
+    
+3. Start an interactive session in the NGC container to run preprocessing/training/inference.
+    
+The following command will launch the container and mount the `./data` directory as a volume to the `/data` directory inside the container, and `./results` directory to the `/results` directory in the container.
+    
+```
+mkdir data results
+docker run -it --runtime=nvidia --shm-size=8g --ulimit memlock=-1 --ulimit stack=67108864 --rm -v ${PWD}/data:/data -v ${PWD}/results:/results nnunet_pyt:latest /bin/bash
+```
+
+4. Prepare BraTS dataset.
+
+To download dataset run:
+
+```
+python download.py --task 01
+```
+
+then to preprocess 2D or 3D dataset version run:
+
+```
+python preprocess.py --task 01 --dim {2,3}
+```
+
+If you have prepared both 2D and 3D datasets then `ls /data` should print:
+```
+01_3d 01_2d Task01_BrainTumour
+```
+
+For the specifics concerning data preprocessing, see the [Getting the data](#getting-the-data) section.
+    
+5. Start training.
+   
+Training can be started with:
+```
+python scripts/train.py --gpus <gpus> --fold <fold> --dim <dim> [--amp]
+```
+
+Where:
+```
+--gpus             number of gpus
+--fold             fold number, possible choices: `0, 1, 2, 3, 4`
+--dim              U-Net dimension, possible choices: `2, 3`
+--amp              enable automatic mixed precision
+```
+You can customize the training process. For details, see the [Training process](#training-process) section.
+
+6. Start benchmarking.
+
+The training and inference performance can be evaluated by using benchmarking scripts, such as:
+ 
+```
+python scripts/benchmark.py --mode {train, predict} --gpus <ngpus> --dim {2,3} --batch_size <bsize> [--amp] 
+```
+
+ which will make the model run and report the performance.
+
+
+7. Start inference/predictions.
+   
+Inference can be started with:
+```
+python scripts/inference.py --dim <dim> --fold <fold> --ckpt_path <path/to/checkpoint> [--amp] [--tta] [--save_preds]
+```
+
+Where:
+```
+--dim              U-Net dimension. Possible choices: `2, 3`
+--fold             fold number. Possible choices: `0, 1, 2, 3, 4`
+--val_batch_size   batch size (default: 4)
+--ckpt_path        path to checkpoint
+--amp              enable automatic mixed precision
+--tta              enable test time augmentation
+--save_preds       enable saving prediction masks
+```
+You can customize the inference process. For details, see the [Inference process](#inference-process) section.
+
+Now that you have your model trained and evaluated, you can choose to compare your training results with our [Training accuracy results](#training-accuracy-results). You can also choose to benchmark yours performance to [Training performance benchmark](#training-performance-results), or [Inference performance benchmark](#inference-performance-results). Following the steps in these sections will ensure that you achieve the same accuracy and performance results as stated in the [Results](#results) section.
+    
+## Advanced
+
+The following sections provide greater details of the dataset, running training and inference, and the training results.
+
+### Scripts and sample code
+
+In the root directory, the most important files are:
+
+* `main.py`: Entry point to the application. Runs training, evaluation, inference or benchmarking.
+* `preprocess.py`: Entry point to data preprocessing.
+* `download.py`: Downloads given dataset from [Medical Segmentation Decathlon](http://medicaldecathlon.com/).
+* `Dockerfile`: Container with the basic set of dependencies to run nnU-Net.
+* `requirements.txt:` Set of extra requirements for running nnU-Net.
+    
+The `data_preprocessing/` folder contains information about the data preprocessing used by nnU-Net. Its contents are:
+    
+* `configs.py`: Defines dataset configuration like patch size or spacing.
+* `preprocessor.py`: Implements data preprocessing pipeline.
+* `convert2tfrec.py`: Implements conversion from numpy files to tfrecords.
+    
+The `data_loading/` folder contains information about the data pipeline used by nnU-Net. Its contents are:
+    
+* `data_module.py`: Defines `LightningDataModule` used by PyTorch Lightning.
+* `dali_loader.py`: Implements DALI data loader.
+    
+The `model/` folder contains information about the building blocks of nnU-Net and the way they are assembled. Its contents are:
+    
+* `layers.py`: Implements convolution blocks used by U-Net template.
+* `metrics.py`: Implements metrics and loss function.
+* `nn_unet.py`: Implements training/validation/test logic and dynamic creation of U-Net architecture used by nnU-Net.
+* `unet.py`: Implements the U-Net template.
+    
+The `utils/` folder includes:
+* `utils.py`: Defines some utility functions e.g. parser initialization.
+* `logger.py`: Defines logging callback for performance benchmarking.
+    
+Other folders included in the root directory are:
+
+* `images/`: Contains a model diagram.
+* `scripts/`: Provides scripts for data preprocessing, training, benchmarking and inference of nnU-Net.
+
+### Parameters
+
+The complete list of the available parameters for the `main.py` script contains:
+
+  * `--exec_mode`: Select the execution mode to run the model (default: `train`). Modes available:
+    - `train` - Trains model with validation evaluation after every epoch.
+    - `evaluate` - Loads checkpoint and performs evaluation on validation set (requires `--fold`).
+    - `predict` - Loads checkpoint and runs inference on the validation set. If flag `--save_preds` is also provided then stores the predictions in the `--results_dir` directory.
+  * `--data`: Path to data directory (default: `/data`)
+  * `--results` Path to results directory (default: `/results`)
+  * `--logname` Name of dlloger output (default: `None`)
+  * `--task` Task number. MSD uses numbers 01-10"
+  * `--gpus`: Number of GPUs (default: `1`)
+  * `--dim`: U-Net dimension (default: `3`)
+  * `--amp`: Enable automatic mixed precision (default: `False`)
+  * `--negative_slope` Negative slope for LeakyReLU (default: `0.01`)
+  * `--fold`: Fold number (default: `0`)
+  * `--nfolds`: Number of cross-validation folds (default: `5`)
+  * `--patience`: Early stopping patience (default: `50`)
+  * `--min_epochs`: Force training for at least these many epochs (default: `100`)
+  * `--max_epochs`: Stop training after this number of epochs (default: `10000`)
+  * `--batch_size`: Batch size (default: `2`)
+  * `--val_batch_size`: Validation batch size (default: `4`)
+  * `--tta`: Enable test time augmentation (default: `False`)
+  * `--deep_supervision`: Enable deep supervision (default: `False`)
+  * `--benchmark`: Run model benchmarking (default: `False`)
+  * `--norm`: Normalization layer, one from: {`instance,batch,group`} (default: `instance`)
+  * `--oversampling`: Probability of cropped area to have foreground pixels (default: `0.33`)
+  * `--optimizer`: Optimizer, one from: {`sgd,adam,adamw,radam,fused_adam`} (default: `radam`)
+  * `--learning_rate`: Learning rate (default: `0.001`)
+  * `--momentum`: Momentum factor (default: `0.99`)
+  * `--scheduler`: Learning rate scheduler, one from: {`none,multistep,cosine,plateau`} (default: `none`)
+  * `--steps`: Steps for multi-step scheduler (default: `None`)
+  * `--factor`: Factor used by `multistep` and `reduceLROnPlateau` schedulers (default: `0.1`)
+  * `--lr_patience`: Patience for ReduceLROnPlateau scheduler (default: `75`)
+  * `--weight_decay`: Weight decay (L2 penalty) (default: `0.0001`)
+  * `--seed`: Random seed (default: `1`)
+  * `--num_workers`: Number of subprocesses to use for data loading (default: `8`)
+  * `--resume_training`: Resume training from the last checkpoint (default: `False`)
+  * `--overlap`: Amount of overlap between scans during sliding window inference (default: `0.25`)
+  * `--val_mode`: How to blend output of overlapping windows one from: {`gaussian,constant`} (default: `gaussian`)
+  * `--ckpt_path`: Path to checkpoint
+  * `--save_preds`: Enable prediction saving (default: `False`)
+  * `--warmup`:  Warmup iterations before collecting statistics for model benchmarking. (default: `5`)
+  * `--train_batches`: Limit number of batches for training (default: 0)
+  * `--test_batches`: Limit number of batches for evaluation/inference (default: 0)
+  * `--affinity`: Type of CPU affinity (default: `socket_unique_interleaved`)
+  * `--save_ckpt`: Enable saving checkpoint (default: `False`)
+  * `--gradient_clip_val`: Gradient clipping value (default: `0`)
+
+### Command-line options
+
+To see the full list of available options and their descriptions, use the `-h` or `--help` command-line option, for example:
+
+`python main.py --help`
+
+The following example output is printed when running the model:
+
+```
+usage: main.py [-h] [--exec_mode {train,evaluate,predict}] [--data DATA] [--results RESULTS] [--logname LOGNAME] [--task TASK] [--gpus GPUS] [--num_nodes NUM_NODES] [--learning_rate LEARNING_RATE] [--gradient_clip_val GRADIENT_CLIP_VAL] [--accumulate_grad_batches ACCUMULATE_GRAD_BATCHES] [--negative_slope NEGATIVE_SLOPE] [--tta] [--amp] [--benchmark] [--deep_supervision] [--sync_batchnorm] [--save_ckpt] [--nfolds NFOLDS] [--seed SEED] [--ckpt_path CKPT_PATH] [--fold FOLD] [--patience PATIENCE] [--lr_patience LR_PATIENCE] [--batch_size BATCH_SIZE] [--val_batch_size VAL_BATCH_SIZE] [--steps STEPS [STEPS ...]] [--create_idx] [--profile] [--momentum MOMENTUM] [--weight_decay WEIGHT_DECAY] [--save_preds] [--dim {2,3}] [--resume_training] [--factor FACTOR] [--num_workers NUM_WORKERS] [--min_epochs MIN_EPOCHS] [--max_epochs MAX_EPOCHS] [--warmup WARMUP] [--oversampling OVERSAMPLING] [--norm {instance,batch,group}] [--overlap OVERLAP] [--affinity {socket,single,single_unique,socket_unique_interleaved,socket_unique_continuous,disabled}] [--scheduler {none,multistep,cosine,plateau}] [--optimizer {sgd,adam,adamw,radam,fused_adam}] [--val_mode {gaussian,constant}] [--train_batches TRAIN_BATCHES] [--test_batches TEST_BATCHES]
+
+optional arguments:
+  -h, --help            show this help message and exit
+  --exec_mode {train,evaluate,predict}
+                        Execution mode to run the model (default: train)
+  --data DATA           Path to data directory (default: /data)
+  --results RESULTS     Path to results directory (default: /results)
+  --logname LOGNAME     Name of dlloger output (default: None)
+  --task TASK           Task number. MSD uses numbers 01-10 (default: None)
+  --gpus GPUS           Number of gpus (default: 1)
+  --learning_rate LEARNING_RATE
+                        Learning rate (default: 0.001)
+  --gradient_clip_val GRADIENT_CLIP_VAL
+                        Gradient clipping norm value (default: 0)
+  --negative_slope NEGATIVE_SLOPE
+                        Negative slope for LeakyReLU (default: 0.01)
+  --tta                 Enable test time augmentation (default: False)
+  --amp                 Enable automatic mixed precision (default: False)
+  --benchmark           Run model benchmarking (default: False)
+  --deep_supervision    Enable deep supervision (default: False)
+  --sync_batchnorm      Enable synchronized batchnorm (default: False)
+  --save_ckpt           Enable saving checkpoint (default: False)
+  --nfolds NFOLDS       Number of cross-validation folds (default: 5)
+  --seed SEED           Random seed (default: 1)
+  --ckpt_path CKPT_PATH
+                        Path to checkpoint (default: None)
+  --fold FOLD           Fold number (default: 0)
+  --patience PATIENCE   Early stopping patience (default: 100)
+  --lr_patience LR_PATIENCE
+                        Patience for ReduceLROnPlateau scheduler (default: 70)
+  --batch_size BATCH_SIZE
+                        Batch size (default: 2)
+  --val_batch_size VAL_BATCH_SIZE
+                        Validation batch size (default: 4)
+  --steps STEPS [STEPS ...]
+                        Steps for multistep scheduler (default: None)
+  --create_idx          Create index files for tfrecord (default: False)
+  --profile             Run dlprof profiling (default: False)
+  --momentum MOMENTUM   Momentum factor (default: 0.99)
+  --weight_decay WEIGHT_DECAY
+                        Weight decay (L2 penalty) (default: 0.0001)
+  --save_preds          Enable prediction saving (default: False)
+  --dim {2,3}           UNet dimension (default: 3)
+  --resume_training     Resume training from the last checkpoint (default: False)
+  --factor FACTOR       Scheduler factor (default: 0.3)
+  --num_workers NUM_WORKERS
+                        Number of subprocesses to use for data loading (default: 8)
+  --min_epochs MIN_EPOCHS
+                        Force training for at least these many epochs (default: 100)
+  --max_epochs MAX_EPOCHS
+                        Stop training after this number of epochs (default: 10000)
+  --warmup WARMUP       Warmup iterations before collecting statistics (default: 5)
+  --oversampling OVERSAMPLING
+                        Probability of crop to have some region with positive label (default: 0.33)
+  --norm {instance,batch,group}
+                        Normalization layer (default: instance)
+  --overlap OVERLAP     Amount of overlap between scans during sliding window inference (default: 0.25)
+  --affinity {socket,single,single_unique,socket_unique_interleaved,socket_unique_continuous,disabled}
+                        type of GPU affinity (default: socket_unique_interleaved)
+  --scheduler {none,multistep,cosine,plateau}
+                        Learning rate scheduler (default: none)
+  --optimizer {sgd,adam,adamw,radam,fused_adam}
+                        Optimizer (default: radam)
+  --val_mode {gaussian,constant}
+                        How to blend output of overlapping windows (default: gaussian)
+  --train_batches TRAIN_BATCHES
+                        Limit number of batches for training (used for benchmarking mode only) (default: 0)
+  --test_batches TEST_BATCHES
+                        Limit number of batches for inference (used for benchmarking mode only) (default: 0)
+```
+
+### Getting the data
+
+The nnU-Net model was trained on the [Medical Segmentation Decathlon](http://medicaldecathlon.com/) datasets. All datasets are in Neuroimaging Informatics Technology Initiative (NIfTI) format.
+
+#### Dataset guidelines
+
+To train nnU-Net you will need to preprocess your dataset as a first step with `preprocess.py` script.
+
+The `preprocess.py` script is using the following command-line options:
+
+```
+  --data               Path to data directory (default: `/data`)
+  --results            Path to directory for saving preprocessed data (default: `/data`)
+  --exec_mode          Mode for data preprocessing
+  --task               Number of tasks to be run. MSD uses numbers 01-10
+  --dim                Data dimension to prepare (default: `3`)
+  --n_jobs             Number of parallel jobs for data preprocessing (default: `-1`) 
+  --vpf                Number of volumes per tfrecord (default: `1`) 
+```
+
+To preprocess data for 3D U-Net run: `python preprocess.py --task 01 --dim 3`
+
+In `data_preprocessing/configs.py` for each [Medical Segmentation Decathlon](http://medicaldecathlon.com/) task there are defined: patch size, precomputed spacings and statistics for CT datasets.
+
+The preprocessing pipeline consists of the following steps:
+
+1. Cropping to the region of nonzero values.
+2. Resampling to the median voxel spacing of their respective dataset (exception for anisotropic datasets where the lowest resolution axis is selected to be the 10th percentile of the spacings).
+3. Padding volumes so that dimensions are at least as patch size.
+4. Normalizing
+    * For CT modalities the voxel values are clipped to 0.5 and 99.5 percentiles of the foreground voxels and then data is normalized with mean and standard deviation from collected from foreground voxels.
+    * For MRI modalities z-score normalization is applied.
+
+#### Multi-dataset
+
+Adding your dataset is possible, however, your data should correspond to [Medical Segmentation Decathlon](http://medicaldecathlon.com/) (i.e data should be `NIfTi` format and there should be `dataset.json` file where you need to provide fields: modality, labels and at least one of training, test).
+
+To add your dataset, perform the following:
+
+1. Mount your dataset to `/data` directory.
+ 
+2. In `data_preprocessing/config.py`:
+    - Add to the `task_dir` dictionary your dataset directory name. For example, for Brain Tumour dataset, it corresponds to `"01": "Task01_BrainTumour"`.
+    - Add the patch size that you want to use for training to the `patch_size` dictionary. For example, for Brain Tumour dataset it corresponds to `"01_3d": [128, 128, 128]` for 3D U-Net and `"01_2d": [192, 160]` for 2D U-Net. There are three types of suffixes `_3d, _2d` they correspond to 3D UNet and 2D U-Net.
+
+3. Preprocess your data with `preprocess.py` scripts. For example, to preprocess Brain Tumour dataset for 2D U-Net you should run `python preprocess.py --task 01 --dim 2`.
+
+### Training process
+
+The model trains for at least `--min_epochs` and at most `--max_epochs` epochs. After each epoch evaluation, the validation set is done and validation loss is monitored for early stopping (see `--patience` flag). Default training settings are:
+* RAdam optimizer with learning rate of 0.001 and weight decay 0.0001.
+* Training batch size is set to 2 for 3D U-Net and 16 for 2D U-Net.
+    
+This default parametrization is applied when running scripts from the `./scripts` directory and when running `main.py` without explicitly overriding these parameters. By default, the training is in full precision. To enable AMP, pass the `--amp` flag. AMP can be enabled for every mode of execution.
+
+The default configuration minimizes a function `L = 0.5 * (1 - dice) + 0.5 * cross entropy` during training and reports achieved convergence as [dice coefficient](https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient) per class. The training, with a combination of dice and cross entropy has been proven to achieve better convergence than a training using only dice.
+
+The training can be run directly without using the predefined scripts. The name of the training script is `main.py`. For example:
+
+```
+python main.py --exec_mode train --task 01 --fold 0 --gpus 1 --amp --deep_supervision
+```
+  
+Training artifacts will be saved to `/results` (you can override it with `--results <path/to/results/>`) in the container. Some important artifacts are:
+* `/results/logs.json`: Collected dice scores and loss values evaluated after each epoch during training on validation set.
+* `/results/train_logs.json`: Selected best dice scores achieved during training.
+* `/results/checkpoints`: Saved checkpoints. By default, two checkpoints are saved - one after each epoch ('last.ckpt') and one with the highest validation dice (e.g 'epoch=5.ckpt' for if highest dice was at 5th epoch).
+
+To load the pretrained model provide `--ckpt_path <path/to/checkpoint>`.
+
+### Inference process
+
+Inference can be launched by passing the `--exec_mode predict` flag. For example:
+
+```
+python main.py --exec_mode predict --task 01 --fold 0 --gpus 1 --amp --tta --save_preds --ckpt_path <path/to/checkpoint>
+```
+
+The script will then:
+
+* Load the checkpoint from the directory specified by the `<path/to/checkpoint>` directory
+* Run inference on the preprocessed validation dataset corresponding to fold 0
+* Print achieved score to the console
+* If `--save_preds` is provided then resulting masks in the NumPy format will be saved in the `/results` directory
+                       
+## Performance
+
+### Benchmarking
+
+The following section shows how to run benchmarks to measure the model performance in training and inference modes.
+
+#### Training performance benchmark
+
+To benchmark training, run one of the scripts in `./scripts`:
+
+```
+python scripts/benchmark.py --mode train --gpus <ngpus> --dim {2,3} --batch_size <bsize> [--amp] 
+```
+
+For example, to benchmark 3D U-Net training using mixed-precision on 8 GPUs with batch size of 2 for 80 batches, run:
+
+```
+python scripts/benchmark.py --mode train --gpus 8 --dim 3 --batch_size 2 --train_batches 80 --amp
+```
+
+Each of these scripts will by default run 10 warm-up iterations and benchmark the performance during the next 70 iterations. To modify these values provide: `--warmup <warmup> --train_batches <number/of/train/batches>`.
+
+At the end of the script, a line reporting the best train throughput and latency will be printed.
+
+#### Inference performance benchmark
+
+To benchmark inference, run one of the scripts in `./scripts`:
+
+```
+python scripts/benchmark.py --mode predict --dim {2,3} --batch_size <bsize> --test_batches <number/of/test/batches> [--amp]
+```
+
+For example, to benchmark inference using mixed-precision for 3D U-Net, with batch size of 4 for 80 batches, run:
+
+```
+python scripts/benchmark.py --mode predict --dim 3 --amp --batch_size 4 --test_batches 80
+```
+
+Each of these scripts will by default run 10 warm-up iterations and benchmark the performance during the next 70 iterations. To modify these values provide: `--warmup <warmup> --test_batches <number/of/test/batches>`.
+
+At the end of the script, a line reporting the inference throughput and latency will be printed.
+
+### Results
+
+The following sections provide details on how to achieve the same performance and accuracy in training and inference.
+
+#### Training accuracy results
+
+##### Training accuracy: NVIDIA DGX-1 (8x A100 16GB)
+
+Our results were obtained by running the `python scripts/train.py --gpus {1,8} --fold {0,1,2,3,4} --dim {2,3} --batch_size <bsize> [--amp]` training scripts and averaging results in the PyTorch 20.12 NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs.
+
+| Dimension | GPUs | Batch size / GPU  | Accuracy - mixed precision | Accuracy - FP32 | Time to train - mixed precision | Time to train - TF32|  Time to train speedup (TF32 to mixed precision)        
+|:-:|:-:|:--:|:-----:|:-----:|:-----:|:-----:|:----:|
+| 2 | 1 | 16 |0.7021 |0.7051 |89min  | 104min| 1.17 |
+| 2 | 8 | 16 |0.7316 |0.7316 |13 min | 17 min| 1.31 |
+| 3 | 1 | 2  |0.7436 |0.7433 |241 min|342 min| 1.42 |
+| 3 | 8 | 2  |0.7443 |0.7443 |36 min | 44 min| 1.22 |
+
+##### Training accuracy: NVIDIA DGX-1 (8x V100 16GB)
+
+Our results were obtained by running the `python scripts/train.py --gpus {1,8} --fold {0,1,2,3,4} --dim {2,3} --batch_size <bsize> [--amp]` training scripts and averaging results in the PyTorch 20.12 NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs.
+
+| Dimension | GPUs | Batch size / GPU | Accuracy - mixed precision |  Accuracy - FP32 |  Time to train - mixed precision | Time to train - FP32  | Time to train speedup (FP32 to mixed precision)        
+|:-:|:-:|:--:|:-----:|:-----:|:-----:|:-----:|:----:|
+| 2 | 1 | 16 |0.7034 |0.7033 |144 min|180 min| 1.25 |
+| 2 | 8 | 16 |0.7319 |0.7315 |37 min |44 min | 1.19 |
+| 3 | 1 | 2  |0.7439 |0.7436 |317 min|738 min| 2.32 |
+| 3 | 8 | 2  |0.7440 |0.7441 |58 min |121 min| 2.09 |
+
+#### Training performance results
+
+##### Training performance: NVIDIA DGX-2 A100 (8x A100 80GB)
+
+Our results were obtained by running the `python scripts/benchmark.py --mode train --gpus {1,8} --dim {2,3} --batch_size <bsize> [--amp]` training script in the NGC container on NVIDIA DGX-2 A100 (8x A100 80GB) GPUs. Performance numbers (in volumes per second) were averaged over an entire training epoch.
+
+| Dimension | GPUs | Batch size / GPU  | Throughput - mixed precision [img/s] | Throughput - TF32 [img/s] | Throughput speedup (TF32 - mixed precision) | Weak scaling - mixed precision | Weak scaling - TF32 |
+|:-:|:-:|:--:|:------:|:------:|:-----:|:-----:|:-----:|
+| 2 | 1 | 32 | 674.34 |  489.3 | 1.38  |  N/A  |  N/A  |
+| 2 | 1 | 64 | 856.34 | 565.62 | 1.51  |  N/A  |  N/A  |
+| 2 | 1 | 128| 926.64 | 600.34 | 1.54  |  N/A  |  N/A  |
+| 2 | 8 | 32 | 3957.33 | 3275.88 | 1.21| 5.868 | 6.695 |
+| 2 | 8 | 64 | 5667.14 | 4037.82 | 1.40 | 6.618 | 7.139 |
+| 2 | 8 | 128| 6310.97 | 4568.13 | 1.38 | 6.811 | 7.609 |
+| 3 | 1 | 1  | 4.24 | 3.57 | 1.19 |  N/A  |  N/A  |
+| 3 | 1 | 2  | 6.74 | 5.21 | 1.29 |  N/A  |  N/A  |
+| 3 | 1 | 4  | 9.52 | 4.16 | 2.29 |  N/A  |  N/A  |
+| 3 | 8 | 1  | 32.48 | 27.79 | 1.17 | 7.66 | 7.78 |
+| 3 | 8 | 2  | 51.50 | 40.67 | 1.27 | 7.64 | 7.81 |
+| 3 | 8 | 4  | 74.29 | 31.50 | 2.36 | 7.80 | 7.57 |
+
+To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
+
+
+##### Training performance: NVIDIA DGX-1 (8x V100 16GB)
+
+Our results were obtained by running the `python scripts/benchmark.py --mode train --gpus {1,8} --dim {2,3} --batch_size <bsize> [--amp]` training script in the PyTorch 20.10 NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs. Performance numbers (in volumes per second) were averaged over an entire training epoch.
+
+| Dimension | GPUs | Batch size / GPU | Throughput - mixed precision [img/s] | Throughput - FP32 [img/s] | Throughput speedup (FP32 - mixed precision) | Weak scaling - mixed precision | Weak scaling - FP32 |
+|:-:|:-:|:---:|:---------:|:-----------:|:--------:|:---------:|:-------------:|
+| 2 | 1 | 32 | 416.68  | 275.99  | 1.51 |  N/A |  N/A |
+| 2 | 1 | 64 | 524.13  | 281.84  | 1.86 |  N/A |  N/A |
+| 2 | 1 | 128| 557.48  | 272.68  | 2.04 |  N/A |  N/A |
+| 2 | 8 | 32 | 2731.22 | 2005.49 | 1.36 | 6.56 | 7.27 |
+| 2 | 8 | 64 | 3604.83 | 2088.58 | 1.73 |  6.88 | 7.41 |
+| 2 | 8 | 128| 4202.35 | 2094.63 |  2.01 | 7.54 | 7.68 |
+| 3 | 1 | 1  | 3.97    | 1.77    |  2.24 |  N/A |  N/A |
+| 3 | 1 | 2  | 5.49    | 2.32    |  2.37 |  N/A |  N/A |
+| 3 | 1 | 4  | 6.78    | OOM     |  N/A  |  N/A |  N/A |
+| 3 | 8 | 1  | 29.98   | 13.78   |  2.18 | 7.55 | 7.79 |
+| 3 | 8 | 2  | 41.31   | 18.11   |  2.28 | 7.53 | 7.81 |
+| 3 | 8 | 4  | 50.26   | OOM     |  N/A  | 7.41 | N/A  |
+ 
+
+To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
+
+#### Inference performance results
+
+##### Inference performance: NVIDIA DGX-2 A100 (1x A100 80GB)
+
+Our results were obtained by running the `python scripts/benchmark.py --mode predict --dim {2,3} --batch_size <bsize> [--amp]` inferencing benchmarking script in the PyTorch 20.10 NGC container on NVIDIA DGX-2 A100 (1x A100 80GB) GPU.
+
+
+FP16
+
+| Dimension | Batch size |   Resolution  | Throughput Avg [img/s] | Latency Avg [ms] | Latency 90% [ms] | Latency 95% [ms] | Latency 99% [ms] |
+|:----------:|:---------:|:-------------:|:----------------------:|:----------------:|:----------------:|:----------------:|:----------------:|
+| 2 | 32 | 4x192x160 | 3281.91  | 9.75  | 9.88  | 10.14 | 10.17 |
+| 2 | 64 | 4x192x160 | 3625.3   | 17.65 | 18.13 | 18.16 | 18.24 |
+| 2 |128 | 4x192x160 | 3867.24  | 33.10 | 33.29 | 33.29 | 33.35 |
+| 3 | 1  | 4x128x128x128 | 10.93| 91.52 | 91.30 | 92,68 | 111.87|
+| 3 | 2  | 4x128x128x128 | 18.85| 106.08| 105.12| 106.05| 127.95|
+| 3 | 4  | 4x128x128x128 | 27.4 | 145.98| 164.05| 165.58| 183.43|
+
+
+TF32
+
+| Dimension | Batch size |   Resolution  | Throughput Avg [img/s] | Latency Avg [ms] | Latency 90% [ms] | Latency 95% [ms] | Latency 99% [ms] |
+|:----------:|:---------:|:-------------:|:----------------------:|:----------------:|:----------------:|:----------------:|:----------------:|
+| 2 | 32 | 4x192x160 | 2002.66  | 15.98 | 16.14 | 16.24 | 16.37|
+| 2 | 64 | 4x192x160 | 2180.54  | 29.35 | 29.50 | 29.51 | 29.59|
+| 2 |128 | 4x192x160 | 2289.12  | 55.92 | 56.08 | 56.13 | 56.36|
+| 3 | 1  | 4x128x128x128 | 10.05| 99.55 | 99.17 | 99.82 |120.39|
+| 3 | 2  | 4x128x128x128 | 16.29|122.78 |123.06 |124.02 |143.47|
+| 3 | 4  | 4x128x128x128 | 15.99|250.16 |273.67 |274.85 |297.06|
+
+Throughput is reported in images per second. Latency is reported in milliseconds per batch.
+To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
+
+
+##### Inference performance: NVIDIA DGX-1 (1x V100 16GB)
+
+Our results were obtained by running the `python scripts/benchmark.py --mode predict --dim {2,3} --batch_size <bsize> [--amp]` inferencing benchmarking script in the PyTorch 20.10 NGC container on NVIDIA DGX-1 with (1x V100 16GB) GPU.
+
+FP16
+ 
+| Dimension | Batch size |   Resolution  | Throughput Avg [img/s] | Latency Avg [ms] | Latency 90% [ms] | Latency 95% [ms] | Latency 99% [ms] |
+|:----------:|:---------:|:-------------:|:----------------------:|:----------------:|:----------------:|:----------------:|:----------------:|
+| 2 | 32 | 4x192x160 | 1697.16 | 18.86 | 18.89 | 18.95 | 18.99 |
+| 2 | 64 | 4x192x160 | 2008.81 | 31.86 | 31.95 | 32.01 | 32.08 |
+| 2 |128 | 4x192x160 | 2221.44 | 57.62 | 57.83 | 57.88 | 57.96 |
+| 3 | 1  | 4x128x128x128 | 11.01 |  90.76 |  89.96 |  90.53 | 116.67 |
+| 3 | 2  | 4x128x128x128 | 16.60 | 120.49 | 119.69 | 120.72 | 146.42 |
+| 3 | 4  | 4x128x128x128 | 21.18 | 188.85 | 211.92 | 214.17 | 238.19 |
+
+FP32
+ 
+| Dimension | Batch size |   Resolution  | Throughput Avg [img/s] | Latency Avg [ms] | Latency 90% [ms] | Latency 95% [ms] | Latency 99% [ms] |
+|:----------:|:---------:|:-------------:|:----------------------:|:----------------:|:----------------:|:----------------:|:----------------:|
+| 2 | 32 | 4x192x160 | 1106.22 | 28.93 | 29.06 | 29.10 | 29.15 |
+| 2 | 64 | 4x192x160 | 1157.24 | 55.30 | 55.39 | 55.44 | 55.50 |
+| 2 |128 | 4x192x160 | 1171.24 | 109.29 | 109.83 | 109.98 | 110.58 |
+| 3 | 1  | 4x128x128x128 | 6.8 | 147.10 | 147.51 | 148.15 | 170.46 |
+| 3 | 2  | 4x128x128x128 | 8.53| 234.46 | 237.00 | 238.43 | 258.92 |
+| 3 | 4  | 4x128x128x128 | 9.6 | 416.83 | 439.97 | 442.12 | 454.69 |
+
+Throughput is reported in images per second. Latency is reported in milliseconds per batch.
+To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
+
+## Release notes
+
+### Changelog
+
+January 2021
+- Initial release
+
+### Known issues
+
+There are no known issues in this release.
--- a/PyTorch/Segmentation/nnUNet/data_loading/dali_loader.py
+++ b/PyTorch/Segmentation/nnUNet/data_loading/dali_loader.py
@ -0,0 +1,260 @@
+import itertools
+import os
+
+import numpy as np
+import nvidia.dali.fn as fn
+import nvidia.dali.math as math
+import nvidia.dali.ops as ops
+import nvidia.dali.tfrecord as tfrec
+import nvidia.dali.types as types
+from nvidia.dali.pipeline import Pipeline
+from nvidia.dali.plugin.pytorch import DALIGenericIterator
+
+
+class TFRecordTrain(Pipeline):
+    def __init__(self, batch_size, num_threads, device_id, **kwargs):
+        super(TFRecordTrain, self).__init__(batch_size, num_threads, device_id)
+        self.dim = kwargs["dim"]
+        self.seed = kwargs["seed"]
+        self.oversampling = kwargs["oversampling"]
+        self.input = ops.TFRecordReader(
+            path=kwargs["tfrecords"],
+            index_path=kwargs["tfrecords_idx"],
+            features={
+                "X_shape": tfrec.FixedLenFeature([self.dim + 1], tfrec.int64, 0),
+                "Y_shape": tfrec.FixedLenFeature([self.dim + 1], tfrec.int64, 0),
+                "X": tfrec.VarLenFeature([], tfrec.float32, 0.0),
+                "Y": tfrec.FixedLenFeature([], tfrec.string, ""),
+                "fname": tfrec.FixedLenFeature([], tfrec.string, ""),
+            },
+            num_shards=kwargs["gpus"],
+            shard_id=device_id,
+            random_shuffle=True,
+            pad_last_batch=True,
+            read_ahead=True,
+            seed=self.seed,
+        )
+        self.patch_size = kwargs["patch_size"]
+        self.crop_shape = types.Constant(np.array(self.patch_size), dtype=types.INT64)
+        self.crop_shape_float = types.Constant(np.array(self.patch_size), dtype=types.FLOAT)
+        self.layout = "CDHW" if self.dim == 3 else "CHW"
+        self.axis_name = "DHW" if self.dim == 3 else "HW"
+
+    def load_data(self, features):
+        img = fn.reshape(features["X"], shape=features["X_shape"], layout=self.layout)
+        lbl = fn.reshape(features["Y"], shape=features["Y_shape"], layout=self.layout)
+        lbl = fn.reinterpret(lbl, dtype=types.DALIDataType.UINT8)
+        return img, lbl
+
+    def random_augmentation(self, probability, augmented, original):
+        condition = fn.cast(fn.coin_flip(probability=probability), dtype=types.DALIDataType.BOOL)
+        neg_condition = condition ^ True
+        return condition * augmented + neg_condition * original
+
+    @staticmethod
+    def slice_fn(img, start_idx, length):
+        return fn.slice(img, start_idx, length, axes=[0])
+
+    def crop_fn(self, img, lbl):
+        center = fn.segmentation.random_mask_pixel(lbl, foreground=fn.coin_flip(probability=self.oversampling))
+        crop_anchor = self.slice_fn(center, 1, self.dim) - self.crop_shape // 2
+        adjusted_anchor = math.max(0, crop_anchor)
+        max_anchor = self.slice_fn(fn.shapes(lbl), 1, self.dim) - self.crop_shape
+        crop_anchor = math.min(adjusted_anchor, max_anchor)
+        img = fn.slice(img.gpu(), crop_anchor, self.crop_shape, axis_names=self.axis_name, out_of_bounds_policy="pad")
+        lbl = fn.slice(lbl.gpu(), crop_anchor, self.crop_shape, axis_names=self.axis_name, out_of_bounds_policy="pad")
+        return img, lbl
+
+    def zoom_fn(self, img, lbl):
+        resized_shape = self.crop_shape * self.random_augmentation(0.15, fn.uniform(range=(0.7, 1.0)), 1.0)
+        img, lbl = fn.crop(img, crop=resized_shape), fn.crop(lbl, crop=resized_shape)
+        img = fn.resize(img, interp_type=types.DALIInterpType.INTERP_CUBIC, size=self.crop_shape_float)
+        lbl = fn.resize(lbl, interp_type=types.DALIInterpType.INTERP_NN, size=self.crop_shape_float)
+        return img, lbl
+
+    def noise_fn(self, img):
+        img_noised = img + fn.normal_distribution(img, stddev=fn.uniform(range=(0.0, 0.33)))
+        return self.random_augmentation(0.15, img_noised, img)
+
+    def blur_fn(self, img):
+        img_blured = fn.gaussian_blur(img, sigma=fn.uniform(range=(0.5, 1.5)))
+        return self.random_augmentation(0.15, img_blured, img)
+
+    def brightness_fn(self, img):
+        brightness_scale = self.random_augmentation(0.15, fn.uniform(range=(0.7, 1.3)), 1.0)
+        return img * brightness_scale
+
+    def contrast_fn(self, img):
+        min_, max_ = fn.reductions.min(img), fn.reductions.max(img)
+        scale = self.random_augmentation(0.15, fn.uniform(range=(0.65, 1.5)), 1.0)
+        img = math.clamp(img * scale, min_, max_)
+        return img
+
+    def flips_fn(self, img, lbl):
+        kwargs = {"horizontal": fn.coin_flip(probability=0.33), "vertical": fn.coin_flip(probability=0.33)}
+        if self.dim == 3:
+            kwargs.update({"depthwise": fn.coin_flip(probability=0.33)})
+        return fn.flip(img, **kwargs), fn.flip(lbl, **kwargs)
+
+    def define_graph(self):
+        features = self.input(name="Reader")
+        img, lbl = self.load_data(features)
+        img, lbl = self.crop_fn(img, lbl)
+        img, lbl = self.zoom_fn(img, lbl)
+        img = self.noise_fn(img)
+        img = self.blur_fn(img)
+        img = self.brightness_fn(img)
+        img = self.contrast_fn(img)
+        img, lbl = self.flips_fn(img, lbl)
+        return img, lbl
+
+
+class TFRecordEval(Pipeline):
+    def __init__(self, batch_size, num_threads, device_id, **kwargs):
+        super(TFRecordEval, self).__init__(batch_size, num_threads, device_id)
+        self.input = ops.TFRecordReader(
+            path=kwargs["tfrecords"],
+            index_path=kwargs["tfrecords_idx"],
+            features={
+                "X_shape": tfrec.FixedLenFeature([4], tfrec.int64, 0),
+                "Y_shape": tfrec.FixedLenFeature([4], tfrec.int64, 0),
+                "X": tfrec.VarLenFeature([], tfrec.float32, 0.0),
+                "Y": tfrec.FixedLenFeature([], tfrec.string, ""),
+                "fname": tfrec.FixedLenFeature([], tfrec.string, ""),
+            },
+            shard_id=device_id,
+            num_shards=kwargs["gpus"],
+            read_ahead=True,
+            random_shuffle=False,
+            pad_last_batch=True,
+        )
+
+    def load_data(self, features):
+        img = fn.reshape(features["X"].gpu(), shape=features["X_shape"], layout="CDHW")
+        lbl = fn.reshape(features["Y"].gpu(), shape=features["Y_shape"], layout="CDHW")
+        lbl = fn.reinterpret(lbl, dtype=types.DALIDataType.UINT8)
+        return img, lbl
+
+    def define_graph(self):
+        features = self.input(name="Reader")
+        img, lbl = self.load_data(features)
+        return img, lbl, features["fname"]
+
+
+class TFRecordTest(Pipeline):
+    def __init__(self, batch_size, num_threads, device_id, **kwargs):
+        super(TFRecordTest, self).__init__(batch_size, num_threads, device_id)
+        self.input = ops.TFRecordReader(
+            path=kwargs["tfrecords"],
+            index_path=kwargs["tfrecords_idx"],
+            features={
+                "X_shape": tfrec.FixedLenFeature([4], tfrec.int64, 0),
+                "X": tfrec.VarLenFeature([], tfrec.float32, 0.0),
+                "fname": tfrec.FixedLenFeature([], tfrec.string, ""),
+            },
+            shard_id=device_id,
+            num_shards=kwargs["gpus"],
+            read_ahead=True,
+            random_shuffle=False,
+            pad_last_batch=True,
+        )
+
+    def define_graph(self):
+        features = self.input(name="Reader")
+        img = fn.reshape(features["X"].gpu(), shape=features["X_shape"], layout="CDHW")
+        return img, features["fname"]
+
+
+class TFRecordBenchmark(Pipeline):
+    def __init__(self, batch_size, num_threads, device_id, **kwargs):
+        super(TFRecordBenchmark, self).__init__(batch_size, num_threads, device_id)
+        self.dim = kwargs["dim"]
+        self.input = ops.TFRecordReader(
+            path=kwargs["tfrecords"],
+            index_path=kwargs["tfrecords_idx"],
+            features={
+                "X_shape": tfrec.FixedLenFeature([self.dim + 1], tfrec.int64, 0),
+                "Y_shape": tfrec.FixedLenFeature([self.dim + 1], tfrec.int64, 0),
+                "X": tfrec.VarLenFeature([], tfrec.float32, 0.0),
+                "Y": tfrec.FixedLenFeature([], tfrec.string, ""),
+                "fname": tfrec.FixedLenFeature([], tfrec.string, ""),
+            },
+            shard_id=device_id,
+            num_shards=kwargs["gpus"],
+            read_ahead=True,
+        )
+        self.patch_size = kwargs["patch_size"]
+        self.layout = "CDHW" if self.dim == 3 else "CHW"
+
+    def load_data(self, features):
+        img = fn.reshape(features["X"].gpu(), shape=features["X_shape"], layout=self.layout)
+        lbl = fn.reshape(features["Y"].gpu(), shape=features["Y_shape"], layout=self.layout)
+        lbl = fn.reinterpret(lbl, dtype=types.DALIDataType.UINT8)
+        return img, lbl
+
+    def crop_fn(self, img, lbl):
+        img = fn.crop(img, crop=self.patch_size)
+        lbl = fn.crop(lbl, crop=self.patch_size)
+        return img, lbl
+
+    def define_graph(self):
+        features = self.input(name="Reader")
+        img, lbl = self.load_data(features)
+        img, lbl = self.crop_fn(img, lbl)
+        return img, lbl
+
+
+class LightningWrapper(DALIGenericIterator):
+    def __init__(self, pipe, **kwargs):
+        super().__init__(pipe, **kwargs)
+
+    def __next__(self):
+        out = super().__next__()
+        out = out[0]
+        return out
+
+
+def fetch_dali_loader(tfrecords, idx_files, batch_size, mode, **kwargs):
+    assert len(tfrecords) > 0, "Got empty tfrecord list"
+    assert len(idx_files) == len(tfrecords), f"Got {len(idx_files)} index files but {len(tfrecords)} tfrecords"
+
+    if kwargs["benchmark"]:
+        tfrecords = list(itertools.chain(*(20 * [tfrecords])))
+        idx_files = list(itertools.chain(*(20 * [idx_files])))
+
+    pipe_kwargs = {
+        "tfrecords": tfrecords,
+        "tfrecords_idx": idx_files,
+        "gpus": kwargs["gpus"],
+        "seed": kwargs["seed"],
+        "patch_size": kwargs["patch_size"],
+        "dim": kwargs["dim"],
+        "oversampling": kwargs["oversampling"],
+    }
+
+    if kwargs["benchmark"] and mode == "eval":
+        pipeline = TFRecordBenchmark
+        output_map = ["image", "label"]
+        dynamic_shape = False
+    elif mode == "training":
+        pipeline = TFRecordTrain
+        output_map = ["image", "label"]
+        dynamic_shape = False
+    elif mode == "eval":
+        pipeline = TFRecordEval
+        output_map = ["image", "label", "fname"]
+        dynamic_shape = True
+    else:
+        pipeline = TFRecordTest
+        output_map = ["image", "fname"]
+        dynamic_shape = True
+
+    device_id = int(os.getenv("LOCAL_RANK", "0"))
+    pipe = pipeline(batch_size, kwargs["num_workers"], device_id, **pipe_kwargs)
+    return LightningWrapper(
+        pipe,
+        auto_reset=True,
+        reader_name="Reader",
+        output_map=output_map,
+        dynamic_shape=dynamic_shape,
+    )
--- a/PyTorch/Segmentation/nnUNet/data_loading/data_module.py
+++ b/PyTorch/Segmentation/nnUNet/data_loading/data_module.py
@ -0,0 +1,123 @@
+import glob
+import os
+from subprocess import call
+
+import numpy as np
+from joblib import Parallel, delayed
+from pytorch_lightning import LightningDataModule
+from sklearn.model_selection import KFold
+from tqdm import tqdm
+from utils.utils import get_config_file, get_task_code, is_main_process, make_empty_dir
+
+from data_loading.dali_loader import fetch_dali_loader
+
+
+class DataModule(LightningDataModule):
+    def __init__(self, args):
+        super().__init__()
+        self.args = args
+        self.tfrecords_train = []
+        self.tfrecords_val = []
+        self.tfrecords_test = []
+        self.train_idx = []
+        self.val_idx = []
+        self.test_idx = []
+        self.kfold = KFold(n_splits=self.args.nfolds, shuffle=True, random_state=12345)
+        self.data_path = os.path.join(self.args.data, get_task_code(self.args))
+        if self.args.exec_mode == "predict" and not args.benchmark:
+            self.data_path = os.path.join(self.data_path, "test")
+        configs = get_config_file(self.args)
+        self.kwargs = {
+            "dim": self.args.dim,
+            "patch_size": configs["patch_size"],
+            "seed": self.args.seed,
+            "gpus": self.args.gpus,
+            "num_workers": self.args.num_workers,
+            "oversampling": self.args.oversampling,
+            "create_idx": self.args.create_idx,
+            "benchmark": self.args.benchmark,
+        }
+
+    def prepare_data(self):
+        if self.args.create_idx:
+            tfrecords_train, tfrecords_val, tfrecords_test = self.load_tfrecords()
+            make_empty_dir("train_idx")
+            make_empty_dir("val_idx")
+            make_empty_dir("test_idx")
+            self.create_idx("train_idx", tfrecords_train)
+            self.create_idx("val_idx", tfrecords_val)
+            self.create_idx("test_idx", tfrecords_test)
+
+    def setup(self, stage=None):
+        self.tfrecords_train, self.tfrecords_val, self.tfrecords_test = self.load_tfrecords()
+        self.train_idx, self.val_idx, self.test_idx = self.load_idx_files()
+        if is_main_process():
+            ntrain, nval, ntest = len(self.tfrecords_train), len(self.tfrecords_val), len(self.tfrecords_test)
+            print(f"Number of examples: Train {ntrain} - Val {nval} - Test {ntest}")
+
+    def train_dataloader(self):
+        return fetch_dali_loader(self.tfrecords_train, self.train_idx, self.args.batch_size, "training", **self.kwargs)
+
+    def val_dataloader(self):
+        return fetch_dali_loader(self.tfrecords_val, self.val_idx, 1, "eval", **self.kwargs)
+
+    def test_dataloader(self):
+        if self.kwargs["benchmark"]:
+            return fetch_dali_loader(
+                self.tfrecords_train, self.train_idx, self.args.val_batch_size, "eval", **self.kwargs
+            )
+        return fetch_dali_loader(self.tfrecords_test, self.test_idx, 1, "test", **self.kwargs)
+
+    def load_tfrecords(self):
+        if self.args.dim == 2:
+            train_tfrecords = self.load_data(self.data_path, "*.train_tfrecord")
+            val_tfrecords = self.load_data(self.data_path, "*.val_tfrecord")
+        else:
+            train_tfrecords = self.load_data(self.data_path, "*.tfrecord")
+            val_tfrecords = self.load_data(self.data_path, "*.tfrecord")
+
+        train_idx, val_idx = list(self.kfold.split(train_tfrecords))[self.args.fold]
+        train_tfrecords = self.get_split(train_tfrecords, train_idx)
+        val_tfrecords = self.get_split(val_tfrecords, val_idx)
+
+        return train_tfrecords, val_tfrecords, self.load_data(os.path.join(self.data_path, "test"), "*.tfrecord")
+
+    def load_idx_files(self):
+        if self.args.create_idx:
+            test_idx = sorted(glob.glob(os.path.join("test_idx", "*.idx")))
+        else:
+            test_idx = self.get_idx_list("test/idx_files", self.tfrecords_test)
+
+        if self.args.create_idx:
+            train_idx = sorted(glob.glob(os.path.join("train_idx", "*.idx")))
+            val_idx = sorted(glob.glob(os.path.join("val_idx", "*.idx")))
+        elif self.args.dim == 3:
+            train_idx = self.get_idx_list("idx_files", self.tfrecords_train)
+            val_idx = self.get_idx_list("idx_files", self.tfrecords_val)
+        else:
+            train_idx = self.get_idx_list("train_idx_files", self.tfrecords_train)
+            val_idx = self.get_idx_list("val_idx_files", self.tfrecords_val)
+        return train_idx, val_idx, test_idx
+
+    def create_idx(self, idx_dir, tfrecords):
+        idx_files = [os.path.join(idx_dir, os.path.basename(tfrec).split(".")[0] + ".idx") for tfrec in tfrecords]
+        Parallel(n_jobs=-1)(
+            delayed(self.tfrecord2idx)(tfrec, tfidx)
+            for tfrec, tfidx in tqdm(zip(tfrecords, idx_files), total=len(tfrecords))
+        )
+
+    def get_idx_list(self, dir_name, tfrecords):
+        root_dir = os.path.join(self.data_path, dir_name)
+        return sorted([os.path.join(root_dir, os.path.basename(tfr).split(".")[0] + ".idx") for tfr in tfrecords])
+
+    @staticmethod
+    def get_split(data, idx):
+        return list(np.array(data)[idx])
+
+    @staticmethod
+    def load_data(path, files_pattern):
+        return sorted(glob.glob(os.path.join(path, files_pattern)))
+
+    @staticmethod
+    def tfrecord2idx(tfrecord, tfidx):
+        call(["tfrecord2idx", tfrecord, tfidx])
--- a/PyTorch/Segmentation/nnUNet/data_preprocessing/configs.py
+++ b/PyTorch/Segmentation/nnUNet/data_preprocessing/configs.py
@ -0,0 +1,84 @@
+task = {
+    "01": "Task01_BrainTumour",
+    "02": "Task02_Heart",
+    "03": "Task03_Liver",
+    "04": "Task04_Hippocampus",
+    "05": "Task05_Prostate",
+    "06": "Task06_Lung",
+    "07": "Task07_Pancreas",
+    "08": "Task08_HepaticVessel",
+    "09": "Task09_Spleen",
+    "10": "Task10_Colon",
+}
+
+patch_size = {
+    "01_3d": [128, 128, 128],
+    "02_3d": [80, 192, 160],
+    "03_3d": [128, 128, 128],
+    "04_3d": [40, 56, 40],
+    "05_3d": [20, 320, 256],
+    "06_3d": [80, 192, 160],
+    "07_3d": [40, 224, 224],
+    "08_3d": [64, 192, 192],
+    "09_3d": [64, 192, 160],
+    "10_3d": [56, 192, 160],
+    "01_2d": [192, 160],
+    "02_2d": [320, 256],
+    "03_2d": [512, 512],
+    "04_2d": [56, 40],
+    "05_2d": [320, 320],
+    "06_2d": [512, 512],
+    "07_2d": [512, 512],
+    "08_2d": [512, 512],
+    "09_2d": [512, 512],
+    "10_2d": [512, 512],
+}
+
+
+spacings = {
+    "01_3d": [1.0, 1.0, 1.0],
+    "02_3d": [1.37, 1.25, 1.25],
+    "03_3d": [1, 0.7676, 0.7676],
+    "04_3d": [1.0, 1.0, 1.0],
+    "05_3d": [3.6, 0.62, 0.62],
+    "06_3d": [1.24, 0.79, 0.79],
+    "07_3d": [2.5, 0.8, 0.8],
+    "08_3d": [1.5, 0.8, 0.8],
+    "09_3d": [1.6, 0.79, 0.79],
+    "10_3d": [3, 0.78, 0.78],
+    "11_3d": [5, 0.741, 0.741],
+    "01_2d": [1.0, 1.0],
+    "02_2d": [1.25, 1.25],
+    "03_2d": [0.7676, 0.7676],
+    "04_2d": [1.0, 1.0],
+    "05_2d": [0.62, 0.62],
+    "06_2d": [0.79, 0.79],
+    "07_2d": [0.8, 0.8],
+    "08_2d": [0.8, 0.8],
+    "09_2d": [0.79, 0.79],
+    "10_2d": [0.78, 0.78],
+}
+
+ct_min = {
+    "03": -17,
+    "06": -1024,
+    "07": -96,
+    "08": -3,
+    "09": -41,
+    "10": -30,
+    "11": -958,
+}
+
+ct_max = {
+    "03": 201,
+    "06": 325,
+    "07": 215,
+    "08": 243,
+    "09": 176,
+    "10": 165.82,
+    "11": 93,
+}
+
+ct_mean = {"03": 99.4, "06": -158.58, "07": 77.9, "08": 104.37, "09": 99.29, "10": 62.18, "11": -547.7}
+
+ct_std = {"03": 39.36, "06": 324.7, "07": 75.4, "08": 52.62, "09": 39.47, "10": 32.65, "11": 281.08}
--- a/PyTorch/Segmentation/nnUNet/data_preprocessing/convert2tfrec.py
+++ b/PyTorch/Segmentation/nnUNet/data_preprocessing/convert2tfrec.py
@ -0,0 +1,112 @@
+import math
+import os
+from glob import glob
+from subprocess import call
+
+os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
+
+import numpy as np
+import tensorflow as tf
+from joblib import Parallel, delayed
+from tqdm import tqdm
+from utils.utils import get_task_code, make_empty_dir
+
+
+class Converter:
+    def __init__(self, args):
+        self.args = args
+        self.mode = self.args.exec_mode
+        task_code = get_task_code(self.args)
+        self.data = os.path.join(self.args.data, task_code)
+        self.results = os.path.join(self.args.results, task_code)
+        if self.mode == "test":
+            self.data = os.path.join(self.data, "test")
+            self.results = os.path.join(self.results, "test")
+        self.vpf = self.args.vpf
+
+        self.imgs = self.load_files("*x.npy")
+        self.lbls = self.load_files("*y.npy")
+
+    def run(self):
+        print("Saving tfrecords...")
+        suffix = "tfrecord" if self.args.dim == 3 else "val_tfrecord"
+        self.save_tfrecords(self.imgs, self.lbls, dim=3, suffix=suffix)
+        if self.args.dim == 2:
+            self.save_tfrecords(self.imgs, self.lbls, dim=2, suffix="train_tfrecord")
+            train_tfrecords, train_idx_dir = self.get_tfrecords_data("*.train_tfrecord", "train_idx_files")
+            val_tfrecords, val_idx_dir = self.get_tfrecords_data("*.val_tfrecord", "val_idx_files")
+            print("Saving idx files...")
+            self.create_idx_files(train_tfrecords, train_idx_dir)
+            self.create_idx_files(val_tfrecords, val_idx_dir)
+        else:
+            tfrecords, idx_dir = self.get_tfrecords_data("*.tfrecord", "idx_files")
+            print("Saving idx files...")
+            self.create_idx_files(tfrecords, idx_dir)
+
+    def save_tfrecords(self, imgs, lbls, dim, suffix):
+        if len(lbls) == 0:
+            lbls = imgs[:]
+        chunks = np.array_split(list(zip(imgs, lbls)), math.ceil(len(imgs) / self.args.vpf))
+        Parallel(n_jobs=self.args.n_jobs)(
+            delayed(self.convert2tfrec)(chunk, dim, suffix) for chunk in tqdm(chunks, total=len(chunks))
+        )
+
+    def convert2tfrec(self, files, dim, suffix):
+        examples = []
+        for img_path, lbl_path in files:
+            img, lbl = np.load(img_path), np.load(lbl_path)
+            if dim == 2:
+                for depth in range(img.shape[1]):
+                    examples.append(self.create_example(img[:, depth], lbl[:, depth], os.path.basename(img_path)))
+            else:
+                examples.append(self.create_example(img, lbl, os.path.basename(img_path)))
+
+        fname = os.path.basename(files[0][0]).replace("_x.npy", "")
+        tfrecord_name = os.path.join(self.results, f"{fname}.{suffix}")
+        with tf.io.TFRecordWriter(tfrecord_name) as writer:
+            for example in examples:
+                writer.write(example.SerializeToString())
+
+    def create_idx_files(self, tfrecords, save_dir):
+        make_empty_dir(save_dir)
+        tfrecords_idx = []
+        for tfrec in tfrecords:
+            fname = os.path.basename(tfrec).split(".")[0]
+            tfrecords_idx.append(os.path.join(save_dir, f"{fname}.idx"))
+
+        Parallel(n_jobs=self.args.n_jobs)(
+            delayed(self.create_idx)(tr, ti) for tr, ti in tqdm(zip(tfrecords, tfrecords_idx), total=len(tfrecords))
+        )
+
+    def create_example(self, img, lbl, fname):
+        def _float_feature(value):
+            return tf.train.Feature(float_list=tf.train.FloatList(value=value))
+
+        def _int64_feature(value):
+            return tf.train.Feature(int64_list=tf.train.Int64List(value=value))
+
+        def _bytes_feature(value):
+            return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
+
+        feature = {
+            "X": _float_feature(img.flatten()),
+            "X_shape": _int64_feature(img.shape),
+            "fname": _bytes_feature(str.encode(fname)),
+        }
+
+        if self.mode == "training":
+            feature.update({"Y": _bytes_feature(lbl.flatten().tobytes()), "Y_shape": _int64_feature(lbl.shape)})
+
+        return tf.train.Example(features=tf.train.Features(feature=feature))
+
+    @staticmethod
+    def create_idx(tfrecord, tfidx):
+        call(["tfrecord2idx", tfrecord, tfidx])
+
+    def load_files(self, suffix):
+        return sorted(glob(os.path.join(self.data, suffix)))
+
+    def get_tfrecords_data(self, tfrec_pattern, idx_dir):
+        tfrecords = self.load_files(os.path.join(self.results, tfrec_pattern))
+        tfrecords_dir = os.path.join(self.results, idx_dir)
+        return tfrecords, tfrecords_dir
--- a/PyTorch/Segmentation/nnUNet/data_preprocessing/preprocessor.py
+++ b/PyTorch/Segmentation/nnUNet/data_preprocessing/preprocessor.py
@ -0,0 +1,277 @@
+import itertools
+import json
+import math
+import os
+import pickle
+
+import monai.transforms as transforms
+import nibabel
+import numpy as np
+from joblib import Parallel, delayed
+from skimage.transform import resize
+from utils.utils import get_task_code, make_empty_dir
+
+from data_preprocessing.configs import ct_max, ct_mean, ct_min, ct_std, patch_size, spacings, task
+
+
+class Preprocessor:
+    def __init__(self, args):
+        self.args = args
+        self.ct_min = 0
+        self.ct_max = 0
+        self.ct_mean = 0
+        self.ct_std = 0
+        self.target_spacing = None
+        self.task = args.task
+        self.task_code = get_task_code(args)
+        self.patch_size = patch_size[self.task_code]
+        self.training = args.exec_mode == "training"
+        self.data_path = os.path.join(args.data, task[args.task])
+        self.results = os.path.join(args.results, self.task_code)
+        if not self.training:
+            self.results = os.path.join(self.results, "test")
+        self.metadata = json.load(open(os.path.join(self.data_path, "dataset.json"), "r"))
+        self.modality = self.metadata["modality"]["0"]
+        self.crop_foreg = transforms.CropForegroundd(keys=["image", "label"], source_key="image")
+        self.normalize_intensity = transforms.NormalizeIntensity(nonzero=True, channel_wise=True)
+
+    def run(self):
+        make_empty_dir(self.results)
+
+        print(f"Preprocessing {self.data_path}")
+        try:
+            self.target_spacing = spacings[self.task_code]
+        except:
+            self.collect_spacings()
+        print(f"Target spacing {self.target_spacing}")
+
+        if self.modality == "CT":
+            try:
+                self.ct_min = ct_min[self.task]
+                self.ct_max = ct_max[self.task]
+                self.ct_mean = ct_mean[self.task]
+                self.ct_std = ct_std[self.task]
+            except:
+                self.collect_intensities()
+
+            _mean = round(self.ct_mean, 2)
+            _std = round(self.ct_std, 2)
+            print(f"[CT] min: {self.ct_min}, max: {self.ct_max}, mean: {_mean}, std: {_std}")
+
+        self.run_parallel(self.preprocess_pair, self.args.exec_mode)
+
+        pickle.dump(
+            {
+                "patch_size": self.patch_size,
+                "spacings": self.target_spacing,
+                "n_class": len(self.metadata["labels"]),
+                "in_channels": len(self.metadata["modality"]),
+            },
+            open(os.path.join(self.results, "config.pkl"), "wb"),
+        )
+
+    def preprocess_pair(self, pair):
+        fname = os.path.basename(pair["image"] if self.training else pair)
+        image, label, image_spacings = self.load_pair(pair)
+        if self.training:
+            data = self.crop_foreg({"image": image, "label": label})
+            image, label = data["image"], data["label"]
+        if self.args.dim == 3:
+            image, label = self.resample(image, label, image_spacings)
+        if self.modality == "CT":
+            image = np.clip(image, self.ct_min, self.ct_max)
+        if self.training:
+            image, label = self.standardize(image, label)
+        image = self.normalize(image)
+        self.save(image, label, fname)
+
+    def resample(self, image, label, image_spacings):
+        if self.target_spacing != image_spacings:
+            image, label = self.resample_pair(image, label, image_spacings)
+        return image, label
+
+    def standardize(self, image, label):
+        pad_shape = self.calculate_pad_shape(image)
+        img_shape = image.shape[1:]
+        if pad_shape != img_shape:
+            paddings = [(pad_sh - img_sh) / 2 for (pad_sh, img_sh) in zip(pad_shape, img_shape)]
+            image = self.pad(image, paddings)
+            label = self.pad(label, paddings)
+        if self.args.dim == 2:  # Center cropping 2D images.
+            _, _, height, weight = image.shape
+            start_h = (height - self.patch_size[0]) // 2
+            start_w = (weight - self.patch_size[1]) // 2
+            image = image[:, :, start_h : start_h + self.patch_size[0], start_w : start_w + self.patch_size[1]]
+            label = label[:, :, start_h : start_h + self.patch_size[0], start_w : start_w + self.patch_size[1]]
+        return image, label
+
+    def normalize(self, image):
+        if self.modality == "CT":
+            return (image - self.ct_mean) / self.ct_std
+        return self.normalize_intensity(image)
+
+    def save(self, image, label, fname):
+        mean, std = np.round(np.mean(image, (1, 2, 3)), 2), np.round(np.std(image, (1, 2, 3)), 2)
+        print(f"Saving {fname} shape {image.shape} mean {mean} std {std}")
+        self.save_3d(image, label, fname)
+
+    def load_pair(self, pair):
+        image = self.load_nifty(pair["image"] if self.training else pair)
+        image_spacing = self.load_spacing(image)
+        image = image.get_fdata().astype(np.float32)
+        image = self.standardize_layout(image)
+
+        label = None
+        if self.training:
+            label = self.load_nifty(pair["label"]).get_fdata().astype(np.uint8)
+            label = self.standardize_layout(label)
+
+        return image, label, image_spacing
+
+    def resample_pair(self, image, label, spacing):
+        shape = self.calculate_new_shape(spacing, image.shape[1:])
+        if self.check_anisotrophy(spacing):
+            image = self.resample_anisotrophic_image(image, shape)
+            if self.training:
+                label = self.resample_anisotrophic_label(label, shape)
+        else:
+            image = self.resample_regular_image(image, shape)
+            if self.training:
+                label = self.resample_regular_label(label, shape)
+        image = image.astype(np.float32)
+        if self.training:
+            label = label.astype(np.uint8)
+        return image, label
+
+    def calculate_pad_shape(self, image):
+        min_shape = self.patch_size[:]
+        img_shape = image.shape[1:]
+        if len(min_shape) == 2:  # In 2D case we don't want to pad depth axis.
+            min_shape.insert(0, img_shape[0])
+        pad_shape = [max(mshape, ishape) for mshape, ishape in zip(min_shape, img_shape)]
+        return pad_shape
+
+    def get_intensities(self, pair):
+        image = self.load_nifty(pair["image"]).get_fdata().astype(np.float32)
+        label = self.load_nifty(pair["label"]).get_fdata().astype(np.uint8)
+        foreground_idx = np.where(label > 0)
+        intensities = image[foreground_idx].tolist()
+        return intensities
+
+    def collect_intensities(self):
+        intensities = self.run_parallel(self.get_intensities, "training")
+        intensities = list(itertools.chain(*intensities))
+        self.ct_min, self.ct_max = np.percentile(intensities, [0.5, 99.5])
+        self.ct_mean, self.ct_std = np.mean(intensities), np.std(intensities)
+
+    def get_spacing(self, pair):
+        image = nibabel.load(os.path.join(self.data_path, pair["image"]))
+        spacing = self.load_spacing(image)
+        return spacing
+
+    def collect_spacings(self):
+        spacing = self.run_parallel(self.get_spacing, "training")
+        spacing = np.array(spacing)
+        target_spacing = np.median(spacing, axis=0)
+        if max(target_spacing) / min(target_spacing) >= 3:
+            lowres_axis = np.argmin(target_spacing)
+            target_spacing[lowres_axis] = np.percentile(spacing[:, lowres_axis], 10)
+        self.target_spacing = list(target_spacing)
+
+    def check_anisotrophy(self, spacing):
+        def check(spacing):
+            return np.max(spacing) / np.min(spacing) >= 3
+
+        return check(spacing) or check(self.target_spacing)
+
+    def calculate_new_shape(self, spacing, shape):
+        spacing_ratio = np.array(spacing) / np.array(self.target_spacing)
+        new_shape = (spacing_ratio * np.array(shape)).astype(int).tolist()
+        return new_shape
+
+    def save_3d(self, image, label, fname):
+        self.save_npy(image, fname, "_x.npy")
+        if self.training:
+            self.save_npy(label, fname, "_y.npy")
+
+    def save_npy(self, img, fname, suffix):
+        np.save(os.path.join(self.results, fname.replace(".nii.gz", suffix)), img, allow_pickle=False)
+
+    def run_parallel(self, func, exec_mode):
+        return Parallel(n_jobs=self.args.n_jobs)(delayed(func)(pair) for pair in self.metadata[exec_mode])
+
+    def load_nifty(self, fname):
+        return nibabel.load(os.path.join(self.data_path, fname))
+
+    @staticmethod
+    def load_spacing(image):
+        return image.header["pixdim"][1:4].tolist()[::-1]
+
+    @staticmethod
+    def pad(image, padding):
+        pad_d, pad_w, pad_h = padding
+        return np.pad(
+            image,
+            (
+                (0, 0),
+                (math.floor(pad_d), math.ceil(pad_d)),
+                (math.floor(pad_w), math.ceil(pad_w)),
+                (math.floor(pad_h), math.ceil(pad_h)),
+            ),
+        )
+
+    @staticmethod
+    def standardize_layout(data):
+        if len(data.shape) == 3:
+            data = np.expand_dims(data, 3)
+        return np.transpose(data, (3, 2, 1, 0))
+
+    @staticmethod
+    def resize_fn(image, shape, order, mode):
+        return resize(image, shape, order=order, mode=mode, cval=0, clip=True, anti_aliasing=False)
+
+    def resample_anisotrophic_image(self, image, shape):
+        resized_channels = []
+        for image_c in image:
+            resized = [self.resize_fn(i, shape[1:], 3, "edge") for i in image_c]
+            resized = np.stack(resized, axis=0)
+            resized = self.resize_fn(resized, shape, 0, "constant")
+            resized_channels.append(resized)
+        resized = np.stack(resized_channels, axis=0)
+        return resized
+
+    def resample_regular_image(self, image, shape):
+        resized_channels = []
+        for image_c in image:
+            resized_channels.append(self.resize_fn(image_c, shape, 3, "edge"))
+        resized = np.stack(resized_channels, axis=0)
+        return resized
+
+    def resample_anisotrophic_label(self, label, shape):
+        depth = label.shape[1]
+        reshaped = np.zeros(shape, dtype=np.uint8)
+        shape_2d = shape[1:]
+        reshaped_2d = np.zeros((depth, *shape_2d), dtype=np.uint8)
+        n_class = np.max(label)
+        for class_ in range(1, n_class + 1):
+            for depth_ in range(depth):
+                mask = label[0, depth_] == class_
+                resized_2d = self.resize_fn(mask.astype(float), shape_2d, 1, "edge")
+                reshaped_2d[depth_][resized_2d >= 0.5] = class_
+
+        for class_ in range(1, n_class + 1):
+            mask = reshaped_2d == class_
+            resized = self.resize_fn(mask.astype(float), shape, 0, "constant")
+            reshaped[resized >= 0.5] = class_
+        reshaped = np.expand_dims(reshaped, 0)
+        return reshaped
+
+    def resample_regular_label(self, label, shape):
+        reshaped = np.zeros(shape, dtype=np.uint8)
+        n_class = np.max(label)
+        for class_ in range(1, n_class + 1):
+            mask = label[0] == class_
+            resized = self.resize_fn(mask.astype(float), shape, 1, "edge")
+            reshaped[resized >= 0.5] = class_
+        reshaped = np.expand_dims(reshaped, 0)
+        return reshaped
--- a/PyTorch/Segmentation/nnUNet/download.py
+++ b/PyTorch/Segmentation/nnUNet/download.py
@ -0,0 +1,17 @@
+import os
+from argparse import ArgumentDefaultsHelpFormatter, ArgumentParser
+from subprocess import call
+
+from data_preprocessing.configs import task
+
+parser = ArgumentParser(formatter_class=ArgumentDefaultsHelpFormatter)
+parser.add_argument("--task", type=str, required=True, help="Task to download")
+parser.add_argument("--results", type=str, default="/data", help="Directory for data storage")
+
+if __name__ == "__main__":
+    args = parser.parse_args()
+    tar_file = task[args.task] + ".tar"
+    file_path = os.path.join(args.results, tar_file)
+    call(f"aws s3 cp s3://msd-for-monai-eu/{tar_file} --no-sign-request {args.results}", shell=True)
+    call(f"tar -xf {file_path} -C {args.results}", shell=True)
+    call(f"rm -rf {file_path}", shell=True)
--- a/PyTorch/Segmentation/nnUNet/images/unet3d.png
+++ b/PyTorch/Segmentation/nnUNet/images/unet3d.png
--- a/PyTorch/Segmentation/nnUNet/main.py
+++ b/PyTorch/Segmentation/nnUNet/main.py
@ -0,0 +1,135 @@
+import os
+
+import pyprof
+import torch
+from dllogger import JSONStreamBackend, Logger, StdOutBackend, Verbosity
+from pytorch_lightning import Trainer, seed_everything
+from pytorch_lightning.callbacks import EarlyStopping, ModelCheckpoint
+
+from data_loading.data_module import DataModule
+from models.nn_unet import NNUnet
+from utils.gpu_affinity import set_affinity
+from utils.logger import LoggingCallback
+from utils.utils import get_main_args, is_main_process, make_empty_dir, set_cuda_devices, verify_ckpt_path
+
+
+def log(logname, dice, epoch=None, dice_tta=None):
+    dllogger = Logger(
+        backends=[
+            JSONStreamBackend(Verbosity.VERBOSE, os.path.join(args.results, logname)),
+            StdOutBackend(Verbosity.VERBOSE, step_format=lambda step: ""),
+        ]
+    )
+    metrics = {}
+    if epoch is not None:
+        metrics.update({"Epoch": epoch})
+    metrics.update({"Mean dice": round(dice.mean().item(), 2)})
+    if dice_tta is not None:
+        metrics.update({"Mean TTA dice": round(dice_tta.mean().item(), 2)})
+    metrics.update({f"L{j+1}": round(m.item(), 2) for j, m in enumerate(dice)})
+    if dice_tta is not None:
+        metrics.update({f"TTA_L{j+1}": round(m.item(), 2) for j, m in enumerate(dice_tta)})
+    dllogger.log(step=(), data=metrics)
+    dllogger.flush()
+
+
+if __name__ == "__main__":
+    args = get_main_args()
+    if args.profile:
+        pyprof.init(enable_function_stack=True)
+        print("Profiling enabled")
+
+    if args.affinity != "disabled":
+        affinity = set_affinity(os.getenv("LOCAL_RANK", "0"), args.affinity)
+
+    set_cuda_devices(args)
+    if is_main_process():
+        print(f"{args.exec_mode.upper()} TASK {args.task} FOLD {args.fold} SEED {args.seed}")
+    seed_everything(args.seed)
+    data_module = DataModule(args)
+    data_module.prepare_data()
+    data_module.setup()
+    ckpt_path = verify_ckpt_path(args)
+
+    callbacks = None
+    model_ckpt = None
+    if args.benchmark:
+        model = NNUnet(args)
+        batch_size = args.batch_size if args.exec_mode == "train" else args.val_batch_size
+        log_dir = os.path.join(args.results, args.logname if args.logname is not None else "perf.json")
+        callbacks = [
+            LoggingCallback(
+                log_dir=log_dir,
+                global_batch_size=batch_size * args.gpus,
+                mode=args.exec_mode,
+                warmup=args.warmup,
+                dim=args.dim,
+                profile=args.profile,
+            )
+        ]
+    elif args.exec_mode == "train":
+        model = NNUnet(args)
+        if args.save_ckpt:
+            model_ckpt = ModelCheckpoint(monitor="dice_sum", mode="max", save_last=True)
+        callbacks = [EarlyStopping(monitor="dice_sum", patience=args.patience, verbose=True, mode="max")]
+    else:  # Evaluation or inference
+        if ckpt_path is not None:
+            model = NNUnet.load_from_checkpoint(ckpt_path)
+        else:
+            model = NNUnet(args)
+
+    trainer = Trainer(
+        logger=False,
+        gpus=args.gpus,
+        precision=16 if args.amp else 32,
+        benchmark=True,
+        deterministic=False,
+        min_epochs=args.min_epochs,
+        max_epochs=args.max_epochs,
+        sync_batchnorm=args.sync_batchnorm,
+        gradient_clip_val=args.gradient_clip_val,
+        callbacks=callbacks,
+        num_sanity_val_steps=0,
+        default_root_dir=args.results,
+        resume_from_checkpoint=ckpt_path,
+        accelerator="ddp" if args.gpus > 1 else None,
+        checkpoint_callback=model_ckpt,
+        limit_train_batches=1.0 if args.train_batches == 0 else args.train_batches,
+        limit_val_batches=1.0 if args.test_batches == 0 else args.test_batches,
+        limit_test_batches=1.0 if args.test_batches == 0 else args.test_batches,
+    )
+
+    if args.benchmark:
+        if args.exec_mode == "train":
+            if args.profile:
+                with torch.autograd.profiler.emit_nvtx():
+                    trainer.fit(model, train_dataloader=data_module.train_dataloader())
+            else:
+                trainer.fit(model, train_dataloader=data_module.train_dataloader())
+        else:
+            trainer.test(model, test_dataloaders=data_module.test_dataloader())
+    elif args.exec_mode == "train":
+        trainer.fit(model, data_module)
+        if model_ckpt is not None:
+            model.args.exec_mode = "evaluate"
+            model.args.tta = True
+            trainer.interrupted = False
+            trainer.test(test_dataloaders=data_module.val_dataloader())
+            if is_main_process():
+                log_name = args.logname if args.logname is not None else "train_log.json"
+                log(log_name, model.best_sum_dice, model.best_sum_epoch, model.eval_dice)
+    elif args.exec_mode == "evaluate":
+        model.args = args
+        trainer.test(model, test_dataloaders=data_module.val_dataloader())
+        if is_main_process():
+            log(args.logname if args.logname is not None else "eval_log.json", model.eval_dice)
+    elif args.exec_mode == "predict":
+        model.args = args
+        if args.save_preds:
+            dir_name = f"preds_task_{args.task}_dim_{args.dim}_fold_{args.fold}"
+            if args.tta:
+                dir_name += "_tta"
+            save_dir = os.path.join(args.results, dir_name)
+            model.save_dir = save_dir
+            make_empty_dir(save_dir)
+        trainer.test(model, test_dataloaders=data_module.test_dataloader())
--- a/PyTorch/Segmentation/nnUNet/models/layers.py
+++ b/PyTorch/Segmentation/nnUNet/models/layers.py
@ -0,0 +1,98 @@
+import numpy as np
+import torch
+import torch.nn as nn
+
+normalizations = {
+    "instancenorm3d": nn.InstanceNorm3d,
+    "instancenorm2d": nn.InstanceNorm2d,
+    "batchnorm3d": nn.BatchNorm3d,
+    "batchnorm2d": nn.BatchNorm2d,
+}
+
+convolutions = {
+    "Conv2d": nn.Conv2d,
+    "Conv3d": nn.Conv3d,
+    "ConvTranspose2d": nn.ConvTranspose2d,
+    "ConvTranspose3d": nn.ConvTranspose3d,
+}
+
+
+def get_norm(name, out_channels):
+    if "groupnorm" in name:
+        return nn.GroupNorm(32, out_channels, affine=True)
+    return normalizations[name](out_channels, affine=True)
+
+
+def get_conv(in_channels, out_channels, kernel_size, stride, dim, bias=False):
+    conv = convolutions[f"Conv{dim}d"]
+    padding = get_padding(kernel_size, stride)
+    return conv(in_channels, out_channels, kernel_size, stride, padding, bias=bias)
+
+
+def get_transp_conv(in_channels, out_channels, kernel_size, stride, dim):
+    conv = convolutions[f"ConvTranspose{dim}d"]
+    padding = get_padding(kernel_size, stride)
+    output_padding = get_output_padding(kernel_size, stride, padding)
+    return conv(in_channels, out_channels, kernel_size, stride, padding, output_padding, bias=True)
+
+
+def get_padding(kernel_size, stride):
+    kernel_size_np = np.atleast_1d(kernel_size)
+    stride_np = np.atleast_1d(stride)
+    padding_np = (kernel_size_np - stride_np + 1) / 2
+    padding = tuple(int(p) for p in padding_np)
+    return padding if len(padding) > 1 else padding[0]
+
+
+def get_output_padding(kernel_size, stride, padding):
+    kernel_size_np = np.atleast_1d(kernel_size)
+    stride_np = np.atleast_1d(stride)
+    padding_np = np.atleast_1d(padding)
+    out_padding_np = 2 * padding_np + stride_np - kernel_size_np
+    out_padding = tuple(int(p) for p in out_padding_np)
+    return out_padding if len(out_padding) > 1 else out_padding[0]
+
+
+class ConvLayer(nn.Module):
+    def __init__(self, in_channels, out_channels, kernel_size, stride, norm, negative_slope, dim):
+        super(ConvLayer, self).__init__()
+        self.conv = get_conv(in_channels, out_channels, kernel_size, stride, dim)
+        self.norm = get_norm(norm, out_channels)
+        self.lrelu = nn.LeakyReLU(negative_slope=negative_slope, inplace=True)
+
+    def forward(self, input_data):
+        return self.lrelu(self.norm(self.conv(input_data)))
+
+
+class ConvBlock(nn.Module):
+    def __init__(self, in_channels, out_channels, kernel_size, stride, norm, negative_slope, dim):
+        super(ConvBlock, self).__init__()
+        self.conv1 = ConvLayer(in_channels, out_channels, kernel_size, stride, norm, negative_slope, dim)
+        self.conv2 = ConvLayer(out_channels, out_channels, kernel_size, 1, norm, negative_slope, dim)
+
+    def forward(self, input_data):
+        out = self.conv1(input_data)
+        out = self.conv2(out)
+        return out
+
+
+class UpsampleBlock(nn.Module):
+    def __init__(self, in_channels, out_channels, kernel_size, stride, norm, negative_slope, dim):
+        super(UpsampleBlock, self).__init__()
+        self.transp_conv = get_transp_conv(in_channels, out_channels, stride, stride, dim)
+        self.conv_block = ConvBlock(2 * out_channels, out_channels, kernel_size, 1, norm, negative_slope, dim)
+
+    def forward(self, input_data, skip_data):
+        out = self.transp_conv(input_data)
+        out = torch.cat((out, skip_data), dim=1)
+        out = self.conv_block(out)
+        return out
+
+
+class OutputBlock(nn.Module):
+    def __init__(self, in_channels, out_channels, dim):
+        super(OutputBlock, self).__init__()
+        self.conv = get_conv(in_channels, out_channels, kernel_size=1, stride=1, dim=dim, bias=True)
+
+    def forward(self, input_data):
+        return self.conv(input_data)
--- a/PyTorch/Segmentation/nnUNet/models/metrics.py
+++ b/PyTorch/Segmentation/nnUNet/models/metrics.py
@ -0,0 +1,50 @@
+import monai
+import torch
+import torch.nn as nn
+from pytorch_lightning.metrics.functional import stat_scores
+from pytorch_lightning.metrics.metric import Metric
+
+
+class Dice(Metric):
+    def __init__(self, nclass):
+        super().__init__(dist_sync_on_step=True)
+        self.add_state("dice", default=torch.zeros((nclass,)), dist_reduce_fx="mean")
+
+    def update(self, pred, target):
+        self.dice = self.compute_stats(pred, target)
+
+    def compute(self):
+        return self.dice
+
+    @staticmethod
+    def compute_stats(pred, target):
+        num_classes = pred.shape[1]
+        _bg = 1
+        scores = torch.zeros(num_classes - _bg, device=pred.device, dtype=torch.float32)
+        precision = torch.zeros(num_classes - _bg, device=pred.device, dtype=torch.float32)
+        recall = torch.zeros(num_classes - _bg, device=pred.device, dtype=torch.float32)
+        for i in range(_bg, num_classes):
+            if not (target == i).any():
+                # no foreground class
+                _, _pred = torch.max(pred, 1)
+                scores[i - _bg] += 1 if not (_pred == i).any() else 0
+                recall[i - _bg] += 1 if not (_pred == i).any() else 0
+                precision[i - _bg] += 1 if not (_pred == i).any() else 0
+                continue
+            _tp, _fp, _tn, _fn, _ = stat_scores(pred=pred, target=target, class_index=i)
+            denom = (2 * _tp + _fp + _fn).to(torch.float)
+            score_cls = (2 * _tp).to(torch.float) / denom if torch.is_nonzero(denom) else 0.0
+            scores[i - _bg] += score_cls
+        return scores
+
+
+class Loss(nn.Module):
+    def __init__(self):
+        super(Loss, self).__init__()
+        self.dice = monai.losses.DiceLoss(to_onehot_y=True, softmax=True, batch=True)
+        self.cross_entropy = nn.CrossEntropyLoss()
+
+    def forward(self, y_pred, y_true):
+        dice = self.dice(y_pred, y_true)
+        cross_entropy = self.cross_entropy(y_pred, y_true[:, 0].long())
+        return (dice + cross_entropy) / 2
--- a/PyTorch/Segmentation/nnUNet/models/nn_unet.py
+++ b/PyTorch/Segmentation/nnUNet/models/nn_unet.py
@ -0,0 +1,241 @@
+import os
+
+import apex
+import numpy as np
+import pytorch_lightning as pl
+import torch
+import torch.nn as nn
+import torch_optimizer as optim
+from dllogger import JSONStreamBackend, Logger, StdOutBackend, Verbosity
+from monai.inferers import sliding_window_inference
+from utils.utils import flip, get_config_file, is_main_process
+
+from models.metrics import Dice, Loss
+from models.unet import UNet
+
+
+class NNUnet(pl.LightningModule):
+    def __init__(self, args):
+        super(NNUnet, self).__init__()
+        self.args = args
+        self.save_hyperparameters()
+        self.build_nnunet()
+        self.loss = Loss()
+        self.dice = Dice(self.n_class)
+        self.best_sum = 0
+        self.eval_dice = 0
+        self.best_sum_epoch = 0
+        self.best_dice = self.n_class * [0]
+        self.best_epoch = self.n_class * [0]
+        self.best_sum_dice = self.n_class * [0]
+        self.learning_rate = args.learning_rate
+        if self.args.exec_mode in ["train", "evaluate"]:
+            self.dllogger = Logger(
+                backends=[
+                    JSONStreamBackend(Verbosity.VERBOSE, os.path.join(args.results, "logs.json")),
+                    StdOutBackend(Verbosity.VERBOSE, step_format=lambda step: f"Epoch: {step} "),
+                ]
+            )
+
+        self.tta_flips = (
+            [[2], [3], [2, 3]] if self.args.dim == 2 else [[2], [3], [4], [2, 3], [2, 4], [3, 4], [2, 3, 4]]
+        )
+
+    def forward(self, img):
+        if self.args.benchmark:
+            return self.model(img)
+        return self.tta_inference(img) if self.args.tta else self.do_inference(img)
+
+    def training_step(self, batch, batch_idx):
+        img, lbl = batch["image"], batch["label"]
+        pred = self.model(img)
+        loss = self.compute_loss(pred, lbl)
+        return loss
+
+    def validation_step(self, batch, batch_idx):
+        img, lbl = batch["image"], batch["label"]
+        pred = self.forward(img)
+        loss = self.loss(pred, lbl)
+        dice = self.dice(pred, lbl[:, 0])
+        return {"val_loss": loss, "val_dice": dice}
+
+    def test_step(self, batch, batch_idx):
+        if self.args.exec_mode == "evaluate":
+            return self.validation_step(batch, batch_idx)
+        img = batch["image"]
+        pred = self.forward(img)
+        if self.args.save_preds:
+            self.save_mask(pred, batch["fname"])
+
+    def build_unet(self, in_channels, n_class, kernels, strides):
+        return UNet(
+            in_channels=in_channels,
+            n_class=n_class,
+            kernels=kernels,
+            strides=strides,
+            normalization_layer=self.args.norm,
+            negative_slope=self.args.negative_slope,
+            deep_supervision=self.args.deep_supervision,
+            dimension=self.args.dim,
+        )
+
+    def get_unet_params(self):
+        config = get_config_file(self.args)
+        in_channels = config["in_channels"]
+        patch_size = config["patch_size"]
+        spacings = config["spacings"]
+        n_class = config["n_class"]
+
+        strides, kernels, sizes = [], [], patch_size[:]
+        while True:
+            spacing_ratio = [spacing / min(spacings) for spacing in spacings]
+            stride = [2 if ratio <= 2 and size >= 8 else 1 for (ratio, size) in zip(spacing_ratio, sizes)]
+            kernel = [3 if ratio <= 2 else 1 for ratio in spacing_ratio]
+            if all(s == 1 for s in stride):
+                break
+            sizes = [i / j for i, j in zip(sizes, stride)]
+            spacings = [i * j for i, j in zip(spacings, stride)]
+            kernels.append(kernel)
+            strides.append(stride)
+            if len(strides) == 5:
+                break
+        strides.insert(0, len(spacings) * [1])
+        kernels.append(len(spacings) * [3])
+
+        return in_channels, n_class, kernels, strides, patch_size
+
+    def build_nnunet(self):
+        in_channels, n_class, kernels, strides, self.patch_size = self.get_unet_params()
+        self.model = self.build_unet(in_channels, n_class, kernels, strides)
+        self.n_class = n_class - 1
+        if is_main_process():
+            print(f"Filters: {self.model.filters}")
+            print(f"Kernels: {kernels}")
+            print(f"Strides: {strides}")
+
+    def compute_loss(self, preds, label):
+        if self.args.deep_supervision:
+            loss = self.loss(preds[0], label)
+            for i, pred in enumerate(preds[1:]):
+                downsampled_label = nn.functional.interpolate(label, pred.shape[2:])
+                loss += 0.5 ** (i + 1) * self.loss(pred, downsampled_label)
+            c_norm = 1 / (2 - 2 ** (-len(preds)))
+            return c_norm * loss
+        return self.loss(preds, label)
+
+    def do_inference(self, image):
+        if self.args.dim == 2:
+            if self.args.exec_mode == "predict" and not self.args.benchmark:
+                return self.inference2d_test(image)
+            return self.inference2d(image)
+
+        return self.sliding_window_inference(image)
+
+    def tta_inference(self, img):
+        pred = self.do_inference(img)
+        for flip_idx in self.tta_flips:
+            pred += flip(self.do_inference(flip(img, flip_idx)), flip_idx)
+        pred /= len(self.tta_flips) + 1
+        return pred
+
+    def inference2d(self, image):
+        batch_modulo = image.shape[2] % self.args.val_batch_size
+        if self.args.benchmark:
+            image = image[:, :, batch_modulo:]
+        elif batch_modulo != 0:
+            batch_pad = self.args.val_batch_size - batch_modulo
+            image = nn.ConstantPad3d((0, 0, 0, 0, batch_pad, 0), 0)(image)
+
+        image = torch.transpose(image.squeeze(0), 0, 1)
+        preds_shape = (image.shape[0], self.n_class + 1, *image.shape[2:])
+        preds = torch.zeros(preds_shape, dtype=image.dtype, device=image.device)
+        for start in range(0, image.shape[0] - self.args.val_batch_size + 1, self.args.val_batch_size):
+            end = start + self.args.val_batch_size
+            pred = self.model(image[start:end])
+            preds[start:end] = pred.data
+
+        if batch_modulo != 0 and not self.args.benchmark:
+            preds = preds[batch_pad:]
+
+        return torch.transpose(preds, 0, 1).unsqueeze(0)
+
+    def inference2d_test(self, image):
+        preds_shape = (image.shape[0], self.n_class + 1, *image.shape[2:])
+        preds = torch.zeros(preds_shape, dtype=image.dtype, device=image.device)
+        for depth in range(image.shape[2]):
+            preds[:, :, depth] = self.sliding_window_inference(image[:, :, depth])
+        return preds
+
+    def sliding_window_inference(self, image):
+        return sliding_window_inference(
+            inputs=image,
+            roi_size=self.patch_size,
+            sw_batch_size=self.args.val_batch_size,
+            predictor=self.model,
+            overlap=self.args.overlap,
+            mode=self.args.val_mode,
+        )
+
+    @staticmethod
+    def metric_mean(name, outputs):
+        return torch.stack([out[name] for out in outputs]).mean(dim=0)
+
+    def validation_epoch_end(self, outputs):
+        loss = self.metric_mean("val_loss", outputs)
+        dice = 100 * self.metric_mean("val_dice", outputs)
+        dice_sum = torch.sum(dice)
+        if dice_sum >= self.best_sum:
+            self.best_sum = dice_sum
+            self.best_sum_dice = dice[:]
+            self.best_sum_epoch = self.current_epoch
+        for i, dice_i in enumerate(dice):
+            if dice_i > self.best_dice[i]:
+                self.best_dice[i], self.best_epoch[i] = dice_i, self.current_epoch
+
+        if is_main_process():
+            metrics = {}
+            metrics.update({"mean dice": round(torch.mean(dice).item(), 2)})
+            metrics.update({"TOP_mean": round(torch.mean(self.best_sum_dice).item(), 2)})
+            metrics.update({f"L{i+1}": round(m.item(), 2) for i, m in enumerate(dice)})
+            metrics.update({f"TOP_L{i+1}": round(m.item(), 2) for i, m in enumerate(self.best_sum_dice)})
+            metrics.update({"val_loss": round(loss.item(), 4)})
+            self.dllogger.log(step=self.current_epoch, data=metrics)
+            self.dllogger.flush()
+
+        self.log("val_loss", loss)
+        self.log("dice_sum", dice_sum)
+
+    def test_epoch_end(self, outputs):
+        if self.args.exec_mode == "evaluate":
+            self.eval_dice = 100 * self.metric_mean("val_dice", outputs)
+
+    def configure_optimizers(self):
+        optimizer = {
+            "sgd": torch.optim.SGD(self.parameters(), lr=self.learning_rate, momentum=self.args.momentum),
+            "adam": torch.optim.Adam(self.parameters(), lr=self.learning_rate, weight_decay=self.args.weight_decay),
+            "adamw": torch.optim.AdamW(self.parameters(), lr=self.learning_rate, weight_decay=self.args.weight_decay),
+            "radam": optim.RAdam(self.parameters(), lr=self.learning_rate, weight_decay=self.args.weight_decay),
+            "fused_adam": apex.optimizers.FusedAdam(
+                self.parameters(), lr=self.learning_rate, weight_decay=self.args.weight_decay
+            ),
+        }[self.args.optimizer.lower()]
+
+        scheduler = {
+            "none": None,
+            "multistep": torch.optim.lr_scheduler.MultiStepLR(optimizer, self.args.steps, gamma=self.args.factor),
+            "cosine": torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, self.args.max_epochs),
+            "plateau": torch.optim.lr_scheduler.ReduceLROnPlateau(
+                optimizer, factor=self.args.factor, patience=self.args.lr_patience
+            ),
+        }[self.args.scheduler.lower()]
+
+        opt_dict = {"optimizer": optimizer, "monitor": "val_loss"}
+        if scheduler is not None:
+            opt_dict.update({"lr_scheduler": scheduler})
+        return opt_dict
+
+    def save_mask(self, pred, fname):
+        fname = str(fname[0].cpu().detach().numpy(), "utf-8").replace("_x", "_pred")
+        pred = nn.functional.softmax(torch.tensor(pred), dim=1)
+        pred = pred.squeeze().cpu().detach().numpy()
+        np.save(os.path.join(self.save_dir, fname), pred, allow_pickle=False)
--- a/PyTorch/Segmentation/nnUNet/models/unet.py
+++ b/PyTorch/Segmentation/nnUNet/models/unet.py
@ -0,0 +1,105 @@
+import torch.nn as nn
+
+from models.layers import ConvBlock, OutputBlock, UpsampleBlock
+
+
+class UNet(nn.Module):
+    def __init__(
+        self,
+        in_channels,
+        n_class,
+        kernels,
+        strides,
+        normalization_layer,
+        negative_slope,
+        deep_supervision,
+        dimension,
+    ):
+        super(UNet, self).__init__()
+        self.dim = dimension
+        self.n_class = n_class
+        self.negative_slope = negative_slope
+        self.deep_supervision = deep_supervision
+        self.norm = normalization_layer + f"norm{dimension}d"
+        self.filters = [min(2 ** (5 + i), 320) for i in range(len(strides))]
+
+        self.input_block = self.get_conv_block(
+            conv_block=ConvBlock,
+            in_channels=in_channels,
+            out_channels=self.filters[0],
+            kernel_size=kernels[0],
+            stride=strides[0],
+        )
+        self.downsamples = self.get_module_list(
+            conv_block=ConvBlock,
+            in_channels=self.filters[:-1],
+            out_channels=self.filters[1:],
+            kernels=kernels[1:-1],
+            strides=strides[1:-1],
+        )
+        self.bottleneck = self.get_conv_block(
+            conv_block=ConvBlock,
+            in_channels=self.filters[-2],
+            out_channels=self.filters[-1],
+            kernel_size=kernels[-1],
+            stride=strides[-1],
+        )
+        self.upsamples = self.get_module_list(
+            conv_block=UpsampleBlock,
+            in_channels=self.filters[1:][::-1],
+            out_channels=self.filters[:-1][::-1],
+            kernels=kernels[1:][::-1],
+            strides=strides[1:][::-1],
+        )
+        self.output_block = self.get_output_block(decoder_level=0)
+        self.deep_supervision_heads = self.get_deep_supervision_heads()
+        self.apply(self.initialize_weights)
+
+    def forward(self, input_data):
+        out = self.input_block(input_data)
+        encoder_outputs = [out]
+        for downsample in self.downsamples:
+            out = downsample(out)
+            encoder_outputs.append(out)
+        out = self.bottleneck(out)
+        decoder_outputs = []
+        for upsample, skip in zip(self.upsamples, reversed(encoder_outputs)):
+            out = upsample(out, skip)
+            decoder_outputs.append(out)
+        out = self.output_block(out)
+        if self.training and self.deep_supervision:
+            out = [out]
+            for i, decoder_out in enumerate(decoder_outputs[2:-1][::-1]):
+                out.append(self.deep_supervision_heads[i](decoder_out))
+        return out
+
+    def get_conv_block(self, conv_block, in_channels, out_channels, kernel_size, stride):
+        return conv_block(
+            in_channels=in_channels,
+            out_channels=out_channels,
+            kernel_size=kernel_size,
+            stride=stride,
+            norm=self.norm,
+            negative_slope=self.negative_slope,
+            dim=self.dim,
+        )
+
+    def get_output_block(self, decoder_level):
+        return OutputBlock(in_channels=self.filters[decoder_level], out_channels=self.n_class, dim=self.dim)
+
+    def get_deep_supervision_heads(self):
+        return nn.ModuleList([self.get_output_block(i + 1) for i in range(len(self.upsamples) - 1)])
+
+    def get_module_list(self, in_channels, out_channels, kernels, strides, conv_block):
+        layers = []
+        for in_channel, out_channel, kernel, stride in zip(in_channels, out_channels, kernels, strides):
+            conv_layer = self.get_conv_block(conv_block, in_channel, out_channel, kernel, stride)
+            layers.append(conv_layer)
+        return nn.ModuleList(layers)
+
+    def initialize_weights(self, module):
+        name = module.__class__.__name__.lower()
+        if name in ["conv2d", "conv3d"]:
+            nn.init.kaiming_normal_(module.weight, a=self.negative_slope)
+        elif name in ["convtranspose2d", "convtranspose3d"]:
+            nn.init.kaiming_normal_(module.weight, a=1.0)
--- a/PyTorch/Segmentation/nnUNet/preprocess.py
+++ b/PyTorch/Segmentation/nnUNet/preprocess.py
@ -0,0 +1,37 @@
+import os
+import time
+from argparse import ArgumentDefaultsHelpFormatter, ArgumentParser
+from subprocess import call
+
+from data_preprocessing.convert2tfrec import Converter
+from data_preprocessing.preprocessor import Preprocessor
+from utils.utils import get_task_code
+
+parser = ArgumentParser(formatter_class=ArgumentDefaultsHelpFormatter)
+parser.add_argument("--data", type=str, default="/data", help="Path to data directory")
+parser.add_argument("--results", type=str, default="/data", help="Path for saving results directory")
+parser.add_argument(
+    "--exec_mode",
+    type=str,
+    default="training",
+    choices=["training", "test"],
+    help="Mode for data preprocessing",
+)
+parser.add_argument("--task", type=str, help="Number of task to be run. MSD uses numbers 01-10")
+parser.add_argument("--dim", type=int, default=3, choices=[2, 3], help="Data dimension to prepare")
+parser.add_argument("--n_jobs", type=int, default=-1, help="Number of parallel jobs for data preprocessing")
+parser.add_argument("--vpf", type=int, default=1, help="Volumes per tfrecord")
+
+
+if __name__ == "__main__":
+    args = parser.parse_args()
+    start = time.time()
+    Preprocessor(args).run()
+    Converter(args).run()
+    task_code = get_task_code(args)
+    path = os.path.join(args.data, task_code)
+    if args.exec_mode == "test":
+        path = os.path.join(path, "test")
+    call(f'find {path} -name "*.npy" -print0 | xargs -0 rm', shell=True)
+    end = time.time()
+    print(f"Preprocessing time: {(end - start):.2f}")
--- a/PyTorch/Segmentation/nnUNet/requirements.txt
+++ b/PyTorch/Segmentation/nnUNet/requirements.txt
@ -0,0 +1,9 @@
+git+https://github.com/NVIDIA/dllogger
+nibabel==3.1.1
+joblib==0.16.0
+scikit-learn==0.23.2
+pynvml==8.0.4
+tensorflow==2.3.1
+pillow==6.2.0
+fsspec==0.8.0
+pytorch_ranger==0.1.1
--- a/PyTorch/Segmentation/nnUNet/scripts/benchmark.py
+++ b/PyTorch/Segmentation/nnUNet/scripts/benchmark.py
@ -0,0 +1,39 @@
+import os
+from argparse import ArgumentDefaultsHelpFormatter, ArgumentParser
+from os.path import dirname
+from subprocess import call
+
+parser = ArgumentParser(ArgumentDefaultsHelpFormatter)
+parser.add_argument("--mode", type=str, required=True, choices=["train", "predict"], help="Benchmarking mode")
+parser.add_argument("--gpus", type=int, default=1, help="Number of GPUs to use")
+parser.add_argument("--dim", type=int, required=True, help="Dimension of UNet")
+parser.add_argument("--batch_size", type=int, default=2, help="Batch size")
+parser.add_argument("--amp", action="store_true", help="Enable automatic mixed precision")
+parser.add_argument("--train_batches", type=int, default=80, help="Number of batches for training")
+parser.add_argument("--test_batches", type=int, default=80, help="Number of batches for inference")
+parser.add_argument("--warmup", type=int, default=10, help="Warmup iterations before collecting statistics")
+parser.add_argument("--results", type=str, default="/results", help="Path to results directory")
+parser.add_argument("--logname", type=str, default="perf.json", help="Name of dlloger output")
+parser.add_argument("--create_idx", action="store_true", help="Create index files for tfrecord")
+parser.add_argument("--profile", action="store_true", help="Enable dlprof profiling")
+
+if __name__ == "__main__":
+    args = parser.parse_args()
+    path_to_main = os.path.join(dirname(dirname(os.path.realpath(__file__))), "main.py")
+    cmd = "python main.py --task 01 --benchmark --max_epochs 1 --min_epochs 1 "
+    cmd += f"--results {args.results} "
+    cmd += f"--logname {args.logname} "
+    cmd += f"--exec_mode {args.mode} "
+    cmd += f"--dim {args.dim} "
+    cmd += f"--gpus {args.gpus} "
+    cmd += f"--train_batches {args.train_batches} "
+    cmd += f"--test_batches {args.test_batches} "
+    cmd += f"--warmup {args.warmup} "
+    cmd += "--amp " if args.amp else ""
+    cmd += "--create_idx " if args.create_idx else ""
+    cmd += "--profile " if args.profile else ""
+    if args.mode == "train":
+        cmd += f"--batch_size {args.batch_size} "
+    else:
+        cmd += f"--val_batch_size {args.batch_size} "
+    call(cmd, shell=True)
--- a/PyTorch/Segmentation/nnUNet/scripts/inference.py
+++ b/PyTorch/Segmentation/nnUNet/scripts/inference.py
@ -0,0 +1,27 @@
+import os
+from argparse import ArgumentDefaultsHelpFormatter, ArgumentParser
+from os.path import dirname
+from subprocess import call
+
+parser = ArgumentParser(ArgumentDefaultsHelpFormatter)
+parser.add_argument("--fold", type=int, required=True, choices=[0, 1, 2, 3, 4], help="Fold number")
+parser.add_argument("--dim", type=int, required=True, help="Dimension of UNet")
+parser.add_argument("--ckpt_path", type=str, required=True, help="Path to checkpoint")
+parser.add_argument("--val_batch_size", type=int, default=4, help="Batch size")
+parser.add_argument("--amp", action="store_true", help="Enable automatic mixed precision")
+parser.add_argument("--tta", action="store_true", help="Enable test time augmentation")
+parser.add_argument("--save_preds", action="store_true", help="Save predicted masks")
+
+
+if __name__ == "__main__":
+    args = parser.parse_args()
+    path_to_main = os.path.join(dirname(dirname(os.path.realpath(__file__))), "main.py")
+    cmd = f"python {path_to_main} --exec_mode evaluate --task 01 --gpus 1 "
+    cmd += f"--dim {args.dim} "
+    cmd += f"--fold {args.fold} "
+    cmd += f"--ckpt_path {args.ckpt_path} "
+    cmd += f"--val_batch_size {args.val_batch_size} "
+    cmd += "--amp " if args.amp else ""
+    cmd += "--tta " if args.tta else ""
+    cmd += "--save_preds " if args.save_preds else ""
+    call(cmd, shell=True)
--- a/PyTorch/Segmentation/nnUNet/scripts/prepare_dataset.sh
+++ b/PyTorch/Segmentation/nnUNet/scripts/prepare_dataset.sh
@ -0,0 +1,3 @@
+python download.py --task 01
+python preprocess.py --task 01 --dim 3
+python preprocess.py --task 01 --dim 2
--- a/PyTorch/Segmentation/nnUNet/scripts/train.py
+++ b/PyTorch/Segmentation/nnUNet/scripts/train.py
@ -0,0 +1,22 @@
+import os
+from argparse import ArgumentDefaultsHelpFormatter, ArgumentParser
+from os.path import dirname
+from subprocess import call
+
+parser = ArgumentParser(ArgumentDefaultsHelpFormatter)
+parser.add_argument("--gpus", type=int, required=True, help="Number of GPUs")
+parser.add_argument("--fold", type=int, required=True, choices=[0, 1, 2, 3, 4], help="Fold number")
+parser.add_argument("--dim", type=int, required=True, choices=[2, 3], help="Dimension of UNet")
+parser.add_argument("--amp", action="store_true", help="Enable automatic mixed precision")
+
+if __name__ == "__main__":
+    args = parser.parse_args()
+    path_to_main = os.path.join(dirname(dirname(os.path.realpath(__file__))), "main.py")
+    cmd = f"python {path_to_main} --exec_mode train --task 01 --deep_supervision --save_ckpt "
+    cmd += f"--dim {args.dim} "
+    cmd += f"--batch_size {2 if args.dim == 3 else 16} "
+    cmd += f"--val_batch_size {4 if args.dim == 3 else 64} "
+    cmd += f"--fold {args.fold} "
+    cmd += f"--gpus {args.gpus} "
+    cmd += "--amp " if args.amp else ""
+    call(cmd, shell=True)
--- a/PyTorch/Segmentation/nnUNet/utils/gpu_affinity.py
+++ b/PyTorch/Segmentation/nnUNet/utils/gpu_affinity.py
@ -0,0 +1,145 @@
+import collections
+import math
+import os
+import pathlib
+import re
+
+import pynvml
+
+pynvml.nvmlInit()
+
+
+def systemGetDriverVersion():
+    return pynvml.nvmlSystemGetDriverVersion()
+
+
+def deviceGetCount():
+    return pynvml.nvmlDeviceGetCount()
+
+
+class device:
+    # assume nvml returns list of 64 bit ints
+    _nvml_affinity_elements = math.ceil(os.cpu_count() / 64)
+
+    def __init__(self, device_idx):
+        super().__init__()
+        self.handle = pynvml.nvmlDeviceGetHandleByIndex(device_idx)
+
+    def getName(self):
+        return pynvml.nvmlDeviceGetName(self.handle)
+
+    def getCpuAffinity(self):
+        affinity_string = ""
+        for j in pynvml.nvmlDeviceGetCpuAffinity(self.handle, device._nvml_affinity_elements):
+            # assume nvml returns list of 64 bit ints
+            affinity_string = "{:064b}".format(j) + affinity_string
+        affinity_list = [int(x) for x in affinity_string]
+        affinity_list.reverse()  # so core 0 is in 0th element of list
+
+        ret = [i for i, e in enumerate(affinity_list) if e != 0]
+        return ret
+
+
+def set_socket_affinity(gpu_id):
+    dev = device(gpu_id)
+    affinity = dev.getCpuAffinity()
+    os.sched_setaffinity(0, affinity)
+
+
+def set_single_affinity(gpu_id):
+    dev = device(gpu_id)
+    affinity = dev.getCpuAffinity()
+    os.sched_setaffinity(0, affinity[:1])
+
+
+def set_single_unique_affinity(gpu_id, world_size):
+    devices = [device(i) for i in range(world_size)]
+    socket_affinities = [dev.getCpuAffinity() for dev in devices]
+
+    siblings_list = get_thread_siblings_list()
+    siblings_dict = dict(siblings_list)
+
+    # remove siblings
+    for idx, socket_affinity in enumerate(socket_affinities):
+        socket_affinities[idx] = list(set(socket_affinity) - set(siblings_dict.values()))
+
+    affinities = []
+    assigned = []
+
+    for socket_affinity in socket_affinities:
+        for core in socket_affinity:
+            if core not in assigned:
+                affinities.append([core])
+                assigned.append(core)
+                break
+    os.sched_setaffinity(0, affinities[gpu_id])
+
+
+def set_socket_unique_affinity(gpu_id, world_size, mode):
+    device_ids = [device(i) for i in range(world_size)]
+    socket_affinities = [dev.getCpuAffinity() for dev in device_ids]
+
+    siblings_list = get_thread_siblings_list()
+    siblings_dict = dict(siblings_list)
+
+    # remove siblings
+    for idx, socket_affinity in enumerate(socket_affinities):
+        socket_affinities[idx] = list(set(socket_affinity) - set(siblings_dict.values()))
+
+    socket_affinities_to_device_ids = collections.defaultdict(list)
+
+    for idx, socket_affinity in enumerate(socket_affinities):
+        socket_affinities_to_device_ids[tuple(socket_affinity)].append(idx)
+
+    for socket_affinity, device_ids in socket_affinities_to_device_ids.items():
+        devices_per_group = len(device_ids)
+        cores_per_device = len(socket_affinity) // devices_per_group
+        for group_id, device_id in enumerate(device_ids):
+            if device_id == gpu_id:
+                if mode == "interleaved":
+                    affinity = list(socket_affinity[group_id::devices_per_group])
+                elif mode == "continuous":
+                    affinity = list(socket_affinity[group_id * cores_per_device : (group_id + 1) * cores_per_device])
+                else:
+                    raise RuntimeError("Unknown set_socket_unique_affinity mode")
+
+                # reintroduce siblings
+                affinity += [siblings_dict[aff] for aff in affinity if aff in siblings_dict]
+                os.sched_setaffinity(0, affinity)
+
+
+def get_thread_siblings_list():
+    path = "/sys/devices/system/cpu/cpu*/topology/thread_siblings_list"
+    thread_siblings_list = []
+    pattern = re.compile(r"(\d+)\D(\d+)")
+    for fname in pathlib.Path(path[0]).glob(path[1:]):
+        with open(fname) as f:
+            content = f.read().strip()
+            res = pattern.findall(content)
+            if res:
+                pair = tuple(map(int, res[0]))
+                thread_siblings_list.append(pair)
+    return thread_siblings_list
+
+
+def set_affinity(gpu_id=None, mode="socket"):
+    if gpu_id is None:
+        gpu_id = int(os.getenv("LOCAL_RANK", 0))
+
+    world_size = int(os.getenv("WORLD_SIZE", 1))
+
+    if mode == "socket":
+        set_socket_affinity(gpu_id)
+    elif mode == "single":
+        set_single_affinity(gpu_id)
+    elif mode == "single_unique":
+        set_single_unique_affinity(gpu_id, world_size)
+    elif mode == "socket_unique_interleaved":
+        set_socket_unique_affinity(gpu_id, world_size, "interleaved")
+    elif mode == "socket_unique_continuous":
+        set_socket_unique_affinity(gpu_id, world_size, "continuous")
+    else:
+        raise RuntimeError("Unknown affinity mode")
+
+    affinity = os.sched_getaffinity(0)
+    return affinity
--- a/PyTorch/Segmentation/nnUNet/utils/logger.py
+++ b/PyTorch/Segmentation/nnUNet/utils/logger.py
@ -0,0 +1,66 @@
+import operator
+import time
+
+import dllogger as logger
+import numpy as np
+import torch.cuda.profiler as profiler
+from dllogger import JSONStreamBackend, StdOutBackend, Verbosity
+from pytorch_lightning import Callback
+
+from utils.utils import is_main_process
+
+
+class LoggingCallback(Callback):
+    def __init__(self, log_dir, global_batch_size, mode, warmup, dim, profile):
+        logger.init(backends=[JSONStreamBackend(Verbosity.VERBOSE, log_dir), StdOutBackend(Verbosity.VERBOSE)])
+        self.warmup_steps = warmup
+        self.global_batch_size = global_batch_size
+        self.step = 0
+        self.dim = dim
+        self.mode = mode
+        self.profile = profile
+        self.timestamps = []
+
+    def do_step(self):
+        self.step += 1
+        if self.profile and self.step == self.warmup_steps:
+            profiler.start()
+        if self.step >= self.warmup_steps:
+            self.timestamps.append(time.time())
+
+    def on_train_batch_start(self, trainer, pl_module, batch, batch_idx, dataloader_idx):
+        self.do_step()
+
+    def on_test_batch_start(self, trainer, pl_module, batch, batch_idx, dataloader_idx):
+        self.do_step()
+
+    def process_performance_stats(self, deltas):
+        def _round3(val):
+            return round(val, 3)
+
+        throughput_imgps = _round3(self.global_batch_size / np.mean(deltas))
+        timestamps_ms = 1000 * deltas
+        stats = {
+            f"throughput_{self.mode}": throughput_imgps,
+            f"latency_{self.mode}_mean": _round3(timestamps_ms.mean()),
+        }
+        for level in [90, 95, 99]:
+            stats.update({f"latency_{self.mode}_{level}": _round3(np.percentile(timestamps_ms, level))})
+
+        return stats
+
+    def log(self):
+        if is_main_process():
+            diffs = list(map(operator.sub, self.timestamps[1:], self.timestamps[:-1]))
+            deltas = np.array(diffs)
+            stats = self.process_performance_stats(deltas)
+            logger.log(step=(), data=stats)
+            logger.flush()
+
+    def on_train_end(self, trainer, pl_module):
+        if self.profile:
+            profiler.stop()
+        self.log()
+
+    def on_test_end(self, trainer, pl_module):
+        self.log()
--- a/PyTorch/Segmentation/nnUNet/utils/utils.py
+++ b/PyTorch/Segmentation/nnUNet/utils/utils.py
@ -0,0 +1,177 @@
+import os
+import pickle
+from argparse import ArgumentDefaultsHelpFormatter, ArgumentParser
+from subprocess import call
+
+import torch
+
+
+def is_main_process():
+    return int(os.getenv("LOCAL_RANK", "0")) == 0
+
+
+def set_cuda_devices(args):
+    assert args.gpus <= torch.cuda.device_count(), f"Requested {args.gpus} gpus, available {torch.cuda.device_count()}."
+    device_list = ",".join([str(i) for i in range(args.gpus)])
+    os.environ["CUDA_VISIBLE_DEVICES"] = os.environ.get("CUDA_VISIBLE_DEVICES", device_list)
+
+
+def verify_ckpt_path(args):
+    resume_path = os.path.join(args.results, "checkpoints", "last.ckpt")
+    ckpt_path = resume_path if args.resume_training and os.path.exists(resume_path) else args.ckpt_path
+    return ckpt_path
+
+
+def get_task_code(args):
+    return f"{args.task}_{args.dim}d"
+
+
+def get_config_file(args):
+    task_code = get_task_code(args)
+    config_file = os.path.join(args.data, task_code, "config.pkl")
+    return pickle.load(open(config_file, "rb"))
+
+
+def make_empty_dir(path):
+    call(["rm", "-rf", path])
+    os.makedirs(path)
+
+
+def flip(data, axis):
+    return torch.flip(data, dims=axis)
+
+
+def positive_int(value):
+    ivalue = int(value)
+    assert ivalue > 0, f"Argparse error. Expected positive integer but got {value}"
+    return ivalue
+
+
+def non_negative_int(value):
+    ivalue = int(value)
+    assert ivalue >= 0, f"Argparse error. Expected positive integer but got {value}"
+    return ivalue
+
+
+def float_0_1(value):
+    ivalue = float(value)
+    assert 0 <= ivalue <= 1, f"Argparse error. Expected float to be in range (0, 1), but got {value}"
+    return ivalue
+
+
+def get_main_args():
+    parser = ArgumentParser(formatter_class=ArgumentDefaultsHelpFormatter)
+    parser.add_argument(
+        "--exec_mode",
+        type=str,
+        choices=["train", "evaluate", "predict"],
+        default="train",
+        help="Execution mode to run the model",
+    )
+    parser.add_argument("--data", type=str, default="/data", help="Path to data directory")
+    parser.add_argument("--results", type=str, default="/results", help="Path to results directory")
+    parser.add_argument("--logname", type=str, default=None, help="Name of dlloger output")
+    parser.add_argument("--task", type=str, help="Task number. MSD uses numbers 01-10")
+    parser.add_argument("--gpus", type=non_negative_int, default=1, help="Number of gpus")
+    parser.add_argument("--learning_rate", type=float, default=0.001, help="Learning rate")
+    parser.add_argument("--gradient_clip_val", type=float, default=0, help="Gradient clipping norm value")
+    parser.add_argument("--negative_slope", type=float, default=0.01, help="Negative slope for LeakyReLU")
+    parser.add_argument("--tta", action="store_true", help="Enable test time augmentation")
+    parser.add_argument("--amp", action="store_true", help="Enable automatic mixed precision")
+    parser.add_argument("--benchmark", action="store_true", help="Run model benchmarking")
+    parser.add_argument("--deep_supervision", action="store_true", help="Enable deep supervision")
+    parser.add_argument("--sync_batchnorm", action="store_true", help="Enable synchronized batchnorm")
+    parser.add_argument("--save_ckpt", action="store_true", help="Enable saving checkpoint")
+    parser.add_argument("--nfolds", type=positive_int, default=5, help="Number of cross-validation folds")
+    parser.add_argument("--seed", type=non_negative_int, default=1, help="Random seed")
+    parser.add_argument("--ckpt_path", type=str, default=None, help="Path to checkpoint")
+    parser.add_argument("--fold", type=non_negative_int, default=0, help="Fold number")
+    parser.add_argument("--patience", type=positive_int, default=100, help="Early stopping patience")
+    parser.add_argument("--lr_patience", type=positive_int, default=70, help="Patience for ReduceLROnPlateau scheduler")
+    parser.add_argument("--batch_size", type=positive_int, default=2, help="Batch size")
+    parser.add_argument("--val_batch_size", type=positive_int, default=4, help="Validation batch size")
+    parser.add_argument("--steps", nargs="+", type=positive_int, required=False, help="Steps for multistep scheduler")
+    parser.add_argument("--create_idx", action="store_true", help="Create index files for tfrecord")
+    parser.add_argument("--profile", action="store_true", help="Run dlprof profiling")
+    parser.add_argument("--momentum", type=float, default=0.99, help="Momentum factor")
+    parser.add_argument("--weight_decay", type=float, default=0.0001, help="Weight decay (L2 penalty)")
+    parser.add_argument("--save_preds", action="store_true", help="Enable prediction saving")
+    parser.add_argument("--dim", type=int, choices=[2, 3], default=3, help="UNet dimension")
+    parser.add_argument("--resume_training", action="store_true", help="Resume training from the last checkpoint")
+    parser.add_argument("--factor", type=float, default=0.3, help="Scheduler factor")
+    parser.add_argument(
+        "--num_workers", type=non_negative_int, default=8, help="Number of subprocesses to use for data loading"
+    )
+    parser.add_argument(
+        "--min_epochs", type=non_negative_int, default=30, help="Force training for at least these many epochs"
+    )
+    parser.add_argument(
+        "--max_epochs", type=non_negative_int, default=10000, help="Stop training after this number of epochs"
+    )
+    parser.add_argument(
+        "--warmup", type=non_negative_int, default=5, help="Warmup iterations before collecting statistics"
+    )
+    parser.add_argument(
+        "--oversampling",
+        type=float_0_1,
+        default=0.33,
+        help="Probability of crop to have some region with positive label",
+    )
+    parser.add_argument(
+        "--norm", type=str, choices=["instance", "batch", "group"], default="instance", help="Normalization layer"
+    )
+    parser.add_argument(
+        "--overlap",
+        type=float_0_1,
+        default=0.25,
+        help="Amount of overlap between scans during sliding window inference",
+    )
+    parser.add_argument(
+        "--affinity",
+        type=str,
+        default="socket_unique_interleaved",
+        choices=[
+            "socket",
+            "single",
+            "single_unique",
+            "socket_unique_interleaved",
+            "socket_unique_continuous",
+            "disabled",
+        ],
+        help="type of CPU affinity",
+    )
+    parser.add_argument(
+        "--scheduler",
+        type=str,
+        default="none",
+        choices=["none", "multistep", "cosine", "plateau"],
+        help="Learning rate scheduler",
+    )
+    parser.add_argument(
+        "--optimizer",
+        type=str,
+        default="radam",
+        choices=["sgd", "adam", "adamw", "radam", "fused_adam"],
+        help="Optimizer",
+    )
+    parser.add_argument(
+        "--val_mode",
+        type=str,
+        choices=["gaussian", "constant"],
+        default="gaussian",
+        help="How to blend output of overlapping windows",
+    )
+    parser.add_argument(
+        "--train_batches",
+        type=non_negative_int,
+        default=0,
+        help="Limit number of batches for training (used for benchmarking mode only)",
+    )
+    parser.add_argument(
+        "--test_batches",
+        type=non_negative_int,
+        default=0,
+        help="Limit number of batches for inference (used for benchmarking mode only)",
+    )
+    args = parser.parse_args()
+    return args