Merge r1.5.0 bugfixes and doc updates to main (#3093)

* update branch Signed-off-by: ericharper <complex451@gmail.com> * Fix quantization bug in Asr (#3062) Signed-off-by: smajumdar <titu1994@gmail.com> * Update reinstall and cherry-pick bignlp commits (#3065) * add ptl install to reinstall and update jenkins install Signed-off-by: ericharper <complex451@gmail.com> * Add a stateless timer to specify max_time per run instead of global m… (#3056) * Add a stateless timer to specify max_time per run instead of global max_time across runs Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Style fixes Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * (1) reduce the validation loss within a epoch, (2) convert global-batch-based iteartion counts to micro-batch-based (#3055) Co-authored-by: Oleksii Kuchaiev <okuchaiev@users.noreply.github.com> * Timer class monitors total time (train + validation + testing) to monitor when to end training (#3061) * Check total time in train/validation to exit Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Style fixes Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> Co-authored-by: Oleksii Kuchaiev <okuchaiev@users.noreply.github.com> * Add PUBLICATIONS.md (#3051) * Add PUBLICATIONS.md Signed-off-by: smajumdar <titu1994@gmail.com> * Add NLP Signed-off-by: smajumdar <titu1994@gmail.com> * Update PUBLICATIONS.md * Update PUBLICATIONS.md * Fix links Signed-off-by: smajumdar <titu1994@gmail.com> Co-authored-by: Eric Harper <complex451@gmail.com> * fix uninstall Signed-off-by: ericharper <complex451@gmail.com> Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca> Co-authored-by: Sangkug Lym <slym@nvidia.com> Co-authored-by: Oleksii Kuchaiev <okuchaiev@users.noreply.github.com> Co-authored-by: Somshubra Majumdar <titu1994@gmail.com> * fix File Load Error (#3069) Signed-off-by: fayejf <fayejf07@gmail.com> Co-authored-by: Eric Harper <complex451@gmail.com> * Update hyper parameter saving (#3058) Signed-off-by: smajumdar <titu1994@gmail.com> Co-authored-by: Eric Harper <complex451@gmail.com> * Exp manager small refactor (#3067) * Exp manager small refactor Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * move super() call earlier in the function Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> Co-authored-by: Somshubra Majumdar <titu1994@gmail.com> * Fix FastPitch Pitch Duration Notebook (#3068) * bugfix Signed-off-by: Jason <jasoli@nvidia.com> * bugfix2 Signed-off-by: Jason <jasoli@nvidia.com> * better check Signed-off-by: Jason <jasoli@nvidia.com> * confusionmatrix (#3085) Signed-off-by: fayejf <fayejf07@gmail.com> * typo and fix link (#3086) Signed-off-by: ekmb <ebakhturina@nvidia.com> * inf cross-checking across tensor-parallel ranks (#3088) * inf cross-checking across tensor-parallel ranks * sylte Signed-off-by: ericharper <complex451@gmail.com> Co-authored-by: Eric Harper <complex451@gmail.com> * Fix save top k (#3075) * inject mp_rank for checkpoint paths in NLPDDPPlugin Signed-off-by: ericharper <complex451@gmail.com> * == instead of i Signed-off-by: ericharper <complex451@gmail.com> * when checking previous run account for mp Signed-off-by: ericharper <complex451@gmail.com> * uninject mp ranks when needed Signed-off-by: ericharper <complex451@gmail.com> * style Signed-off-by: ericharper <complex451@gmail.com> * update branch Signed-off-by: ericharper <complex451@gmail.com> Co-authored-by: Somshubra Majumdar <titu1994@gmail.com> Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca> Co-authored-by: Sangkug Lym <slym@nvidia.com> Co-authored-by: Oleksii Kuchaiev <okuchaiev@users.noreply.github.com> Co-authored-by: fayejf <36722593+fayejf@users.noreply.github.com> Co-authored-by: Jason <jasoli@nvidia.com> Co-authored-by: Evelina <10428420+ekmb@users.noreply.github.com>
2021-10-29 20:15:37 -06:00 · 2021-10-29 20:15:37 -06:00 · 574b1014fd
parent 648c97f076
commit 574b1014fd
13 changed files with 205 additions and 191 deletions
--- a/13
+++ b/13
@ -53,12 +53,6 @@ pipeline {
      }
    }

-    // Revert once import guards are added by PTL or version comparing is fixed
-    stage('Install PyTorch Lighting 1.5 RC') {
-      steps{
-        sh 'pip install pytorch-lightning==1.5.0rc0 && sed -i "s/from pytorch_lightning.callbacks.quantization import QuantizationAwareTraining/try:\\n\\tfrom pytorch_lightning.callbacks.quantization import QuantizationAwareTraining\\nexcept:\\n\\tpass/g" /opt/conda/lib/python3.8/site-packages/pytorch_lightning/callbacks/__init__.py'
-      }
-    }

    stage('NeMo Installation') {
      steps {
@ -66,6 +60,13 @@ pipeline {
      }
    }

+    // Revert once import guards are added by PTL or version comparing is fixed
+    stage('PTL Import Guards') {
+      steps{
+        sh 'sed -i "s/from pytorch_lightning.callbacks.quantization import QuantizationAwareTraining/try:\\n\\tfrom pytorch_lightning.callbacks.quantization import QuantizationAwareTraining\\nexcept:\\n\\tpass/g" /opt/conda/lib/python3.8/site-packages/pytorch_lightning/callbacks/__init__.py'
+      }
+    }
+
    stage('PyTorch Lightning version') {
      steps {
        sh 'python -c "import pytorch_lightning; print(pytorch_lightning.__version__)"'
--- a/examples/nlp/language_modeling/conf/megatron_gpt_config.yaml
+++ b/examples/nlp/language_modeling/conf/megatron_gpt_config.yaml
@ -35,6 +35,7 @@ exp_manager:
    mode: min
    always_save_nemo: False # saves nemo file during validation, not implemented for model parallel
    filename: 'megatron_gpt--{val_loss:.2f}-{step}-{consumed_samples}'
+    model_parallel_size: ${model.tensor_model_parallel_size}


 model:
--- a/examples/nlp/language_modeling/megatron_gpt_pretraining.py
+++ b/examples/nlp/language_modeling/megatron_gpt_pretraining.py
@ -17,11 +17,11 @@ from pathlib import Path
 from omegaconf.omegaconf import OmegaConf
 from pytorch_lightning import Trainer
 from pytorch_lightning.callbacks.timer import Timer
+from pytorch_lightning.trainer.connectors.checkpoint_connector import CheckpointConnector

 from nemo.collections.nlp.models.language_modeling.megatron_gpt_model import MegatronGPTModel
 from nemo.collections.nlp.modules.common.megatron.megatron_utils import compute_model_parallel_rank
 from nemo.collections.nlp.parts.nlp_overrides import (
-    NLPCheckpointConnector,
    NLPDDPPlugin,
    NLPNativeBfloat16PrecisionPlugin,
    NLPNativeMixedPrecisionPlugin,
@ -69,7 +69,7 @@ def main(cfg) -> None:
        resume_from_checkpoint = str(resume_from_checkpoint)
        logging.info(f'Resuming training from checkpoint: {resume_from_checkpoint}')

-    trainer.checkpoint_connector = NLPCheckpointConnector(trainer, resume_from_checkpoint=resume_from_checkpoint)
+    trainer.checkpoint_connector = CheckpointConnector(trainer, resume_from_checkpoint=resume_from_checkpoint)
    # Override timer callback to a stateless one
    for idx, callback in enumerate(trainer.callbacks):
        if isinstance(callback, Timer):
--- a/nemo/collections/asr/parts/submodules/jasper.py
+++ b/nemo/collections/asr/parts/submodules/jasper.py
@ -433,7 +433,7 @@ class SqueezeExcite(nn.Module):

        if self.context_window < 0:
            if PYTORCH_QUANTIZATION_AVAILABLE and self._quantize:
-                if not isinstance(self.pool, quant_nn.QuantAdaptiveAvgPool1d(1)):
+                if not isinstance(self.pool, quant_nn.QuantAdaptiveAvgPool1d):
                    self.pool = quant_nn.QuantAdaptiveAvgPool1d(1)  # context window = T

            elif not PYTORCH_QUANTIZATION_AVAILABLE and self._quantize:
--- a/nemo/collections/nlp/parts/nlp_overrides.py
+++ b/nemo/collections/nlp/parts/nlp_overrides.py
@ -15,6 +15,7 @@
 import os
 import shutil
 import tempfile
+from collections import defaultdict
 from typing import Any, Dict, List, Optional, Union

 import pytorch_lightning as pl
@ -30,12 +31,13 @@ from pytorch_lightning.trainer.trainer import Trainer
 from pytorch_lightning.utilities import rank_zero_warn
 from pytorch_lightning.utilities.cloud_io import atomic_save
 from pytorch_lightning.utilities.enums import GradClipAlgorithmType
+from pytorch_lightning.utilities.types import _PATH
 from torch.nn.modules.module import Module
 from torch.nn.parallel import DistributedDataParallel
 from torch.optim.optimizer import Optimizer

 from nemo.core.connectors.save_restore_connector import SaveRestoreConnector
-from nemo.utils import AppState, logging
+from nemo.utils import AppState, app_state, logging

 try:
    from apex.transformer import parallel_state
@ -127,26 +129,38 @@ class NLPDDPPlugin(DDPPlugin):
                logging.info(f'mp_rank: {app_state.model_parallel_rank}')
                logging.info(f'dp_rank: {app_state.data_parallel_rank}')

-    def save_checkpoint(self, checkpoint: Dict[str, Any], filepath: str) -> None:
-        """Save model/training states as a checkpoint file through state-dump and file-write.
+    def save_checkpoint(self, checkpoint: Dict[str, Any], filepath: _PATH) -> None:
+        # PTL override to accomodate model parallel checkpoints
+        filepath = self._inject_model_parallel_rank(filepath)
+        return super().save_checkpoint(checkpoint, filepath)

-        Args:
-            checkpoint: dict containing model and trainer state
-            filepath: write-target file's path
-        """
+    def remove_checkpoint(self, filepath: _PATH) -> None:
+        # PTL override to accomodate model parallel checkpoints
+        filepath = self._inject_model_parallel_rank(filepath)
+        logging.info(f'Removing checkpoint: {filepath}')
+        return super().remove_checkpoint(filepath)
+
+    def _inject_model_parallel_rank(self, filepath):
        app_state = AppState()
-        # dump states as a checkpoint dictionary object
-        # TrainingTypePlugin.on_save() just seems to return the same thing.
-        # checkpoint = self.on_save(checkpoint)
-        if self.is_global_zero or app_state.data_parallel_rank == 0:
-            try:
-                # write the checkpoint dictionary on the file
-                atomic_save(checkpoint, filepath)
-            except AttributeError as err:
-                key = pl.LightningModule.CHECKPOINT_HYPER_PARAMS_KEY
-                checkpoint.pop(key, None)
-                rank_zero_warn(f"Warning, `{key}` dropped from checkpoint. An attribute is not picklable: {err}")
-                atomic_save(checkpoint, filepath)
+        # inserts mp_rank_XX for model parallel checkpoints
+        if app_state.model_parallel_size is not None and app_state.model_parallel_size > 1:
+            # filepath needs to be updated to include mp_rank
+            dirname = os.path.dirname(filepath)
+            basename = os.path.basename(filepath)
+            filepath = f'{dirname}/mp_rank_{app_state.model_parallel_rank:02d}/{basename}'
+            return filepath
+        else:
+            return filepath
+
+    @property
+    def should_rank_save_checkpoint(self) -> bool:
+        # PTL override that determines if checkpoints should be saved based on rank
+        # for model parallel we need data_parallel_rank==0
+        app_state = AppState()
+        if app_state.model_parallel_size is not None and app_state.model_parallel_size > 1:
+            return app_state.data_parallel_rank == 0
+        else:
+            return super().should_rank_save_checkpoint

    @property
    def distributed_sampler_kwargs(self):
@ -164,38 +178,6 @@ class NLPDDPPlugin(DDPPlugin):
            return super(NLPDDPPlugin, self).distributed_sampler_kwargs


-class NLPCheckpointConnector(CheckpointConnector):
-    """ Override PTL CheckpointConnector to support model parallel checkpoints from Megatron-LM.
-    """
-
-    def __init__(self, trainer, resume_from_checkpoint):
-        super().__init__(trainer, resume_from_checkpoint)
-        if not HAVE_APEX:
-            logging.warning("Apex was not found. Using model parallel or megatron models will error out.")
-
-    def save_checkpoint(self, filepath, weights_only: bool = False) -> None:
-        """Slightly modified version of PyTorch Lightning's save_checkpoint.
-           Accounts for model parallel training.
-           Save model/training states as a checkpoint file through state-dump and file-write.
-
-        Args:
-            filepath: write-target file's path
-            weights_only: saving model weights only
-        """
-        app_state = AppState()
-        if app_state.model_parallel_size is not None:
-            # filepath needs to be updated to include mp_rank
-            dirname = os.path.dirname(filepath)
-            basename = os.path.basename(filepath)
-            filepath = f'{dirname}/mp_rank_{app_state.model_parallel_rank:02d}/{basename}'
-            _checkpoint = self.dump_checkpoint(weights_only)
-            # each model parallel rank needs to save a copy of its model
-            if app_state.data_parallel_rank == 0:
-                self.trainer.accelerator.save_checkpoint(_checkpoint, filepath)
-        else:
-            super().save_checkpoint(filepath, weights_only)
-
-
 class NLPSaveRestoreConnector(SaveRestoreConnector):
    def __init__(self) -> None:
        super().__init__()
@ -243,11 +225,117 @@ class NLPSaveRestoreConnector(SaveRestoreConnector):
            return super().save_to(model, save_path)


+class GradScaler(torch.cuda.amp.GradScaler):
+    """
+    Gradient sclaer for model-parallel inf check. The inf in gradients are checked across tensor-parallel
+    ranks in (1) executing optimizer step and (2) gradient scaler update.
+
+    """
+
+    def __init__(
+        self, init_scale=2.0 ** 16, growth_factor=2.0, backoff_factor=0.5, growth_interval=2000, enabled=True
+    ):
+        super().__init__(
+            init_scale=init_scale,
+            growth_factor=growth_factor,
+            backoff_factor=backoff_factor,
+            growth_interval=growth_interval,
+            enabled=enabled,
+        )
+
+    def _maybe_opt_step(self, optimizer, optimizer_state, *args, **kwargs):
+        retval = None
+        found_inf = torch.cuda.FloatTensor([sum(v.item() for v in optimizer_state["found_inf_per_device"].values())])
+
+        # Update across all model parallel instances.
+        torch.distributed.all_reduce(
+            found_inf, op=torch.distributed.ReduceOp.MAX, group=parallel_state.get_model_parallel_group()
+        )
+
+        if found_inf.item() == 0:
+            retval = optimizer.step(*args, **kwargs)
+        return retval
+
+    def update(self, new_scale=None):
+        """
+        Updates the scale factor.
+
+        If any optimizer steps were skipped the scale is multiplied by ``backoff_factor``
+        to reduce it. If ``growth_interval`` unskipped iterations occurred consecutively,
+        the scale is multiplied by ``growth_factor`` to increase it.
+
+        Passing ``new_scale`` sets the new scale value manually. (``new_scale`` is not
+        used directly, it's used to fill GradScaler's internal scale tensor. So if
+        ``new_scale`` was a tensor, later in-place changes to that tensor will not further
+        affect the scale GradScaler uses internally.)
+
+        Args:
+            new_scale (float or :class:`torch.cuda.FloatTensor`, optional, default=None):  New scale factor.
+
+        .. warning::
+            :meth:`update` should only be called at the end of the iteration, after ``scaler.step(optimizer)`` has
+            been invoked for all optimizers used this iteration.
+        """
+        if not self._enabled:
+            return
+
+        _scale, _growth_tracker = self._check_scale_growth_tracker("update")
+
+        if new_scale is not None:
+            # Accept a new user-defined scale.
+            if isinstance(new_scale, float):
+                self._scale.fill_(new_scale)  # type: ignore[union-attr]
+            else:
+                reason = "new_scale should be a float or a 1-element torch.cuda.FloatTensor with requires_grad=False."
+                assert isinstance(new_scale, torch.cuda.FloatTensor), reason  # type: ignore[attr-defined]
+                assert new_scale.numel() == 1, reason
+                assert new_scale.requires_grad is False, reason
+                self._scale.copy_(new_scale)  # type: ignore[union-attr]
+        else:
+            # Consume shared inf/nan data collected from optimizers to update the scale.
+            # If all found_inf tensors are on the same device as self._scale, this operation is asynchronous.
+            found_infs = [
+                found_inf.to(device=_scale.device, non_blocking=True)
+                for state in self._per_optimizer_states.values()
+                for found_inf in state["found_inf_per_device"].values()
+            ]
+
+            assert len(found_infs) > 0, "No inf checks were recorded prior to update."
+
+            found_inf_combined = found_infs[0]
+
+            # Update across all model parallel instances.
+            torch.distributed.all_reduce(
+                found_inf_combined, op=torch.distributed.ReduceOp.MAX, group=parallel_state.get_model_parallel_group()
+            )
+
+            if len(found_infs) > 1:
+                for i in range(1, len(found_infs)):
+                    found_inf = found_infs[i]
+                    # Update across all model parallel instances.
+                    torch.distributed.all_reduce(
+                        found_inf, op=torch.distributed.ReduceOp.MAX, group=parallel_state.get_model_parallel_group()
+                    )
+                    found_inf_combined += found_inf
+
+            torch._amp_update_scale_(
+                _scale,
+                _growth_tracker,
+                found_inf_combined,
+                self._growth_factor,
+                self._backoff_factor,
+                self._growth_interval,
+            )
+
+        # To prepare for next iteration, clear the data collected from optimizers this iteration.
+        self._per_optimizer_states = defaultdict(torch.cuda.amp.grad_scaler._refresh_per_optimizer_state)
+
+
 class NLPNativeMixedPrecisionPlugin(NativeMixedPrecisionPlugin):
    def __init__(self, init_scale: float = 2 ** 32, growth_interval: int = 1000) -> None:
        super().__init__(precision=16)

-        self.scaler = torch.cuda.amp.GradScaler(init_scale=init_scale, growth_interval=growth_interval)
+        self.scaler = GradScaler(init_scale=init_scale, growth_interval=growth_interval)

    def clip_gradients(
        self,
--- a/nemo/collections/tts/modules/fastpitch.py
+++ b/nemo/collections/tts/modules/fastpitch.py
@ -243,7 +243,8 @@ class FastPitchModule(NeuralModule):
        # Predict pitch
        pitch_predicted = self.pitch_predictor(enc_out, enc_mask)
        if pitch is not None:
-            if self.learn_alignment:
+            if self.learn_alignment and pitch.shape[-1] != pitch_predicted.shape[-1]:
+                # Pitch during training is per spectrogram frame, but during inference, it should be per character
                pitch = average_pitch(pitch.unsqueeze(1), attn_hard_dur).squeeze(1)
            pitch_emb = self.pitch_emb(pitch.unsqueeze(1))
        else:
--- a/nemo/core/classes/modelPT.py
+++ b/nemo/core/classes/modelPT.py
@ -99,7 +99,7 @@ class ModelPT(LightningModule, Model):

        self._cfg = cfg

-        self.save_hyperparameters(self._cfg)
+        self.save_hyperparameters("cfg")
        self._train_dl = None
        self._validation_dl = None
        self._test_dl = None
--- a/nemo/utils/exp_manager.py
+++ b/nemo/utils/exp_manager.py
@ -80,6 +80,7 @@ class CallbackParams:
    postfix: str = ".nemo"
    save_best_model: bool = False
    always_save_nemo: bool = False
+    model_parallel_size: Optional[int] = None


@dataclass
@ -115,6 +116,7 @@ class ExpManagerConfig:
    # logs timing of train/val/test steps
    log_step_timing: Optional[bool] = True
    step_timing_kwargs: Optional[StepTimingParams] = StepTimingParams()
+    model_parallel_size: Optional[int] = None


 class TimingCallback(Callback):
@ -652,12 +654,21 @@ class NeMoModelCheckpoint(ModelCheckpoint):
    """ Light wrapper around Lightning's ModelCheckpoint to force a saved checkpoint on train_end
    """

-    def __init__(self, always_save_nemo=False, save_best_model=False, postfix=".nemo", n_resume=False, **kwargs):
+    def __init__(
+        self,
+        always_save_nemo=False,
+        save_best_model=False,
+        postfix=".nemo",
+        n_resume=False,
+        model_parallel_size=None,
+        **kwargs,
+    ):
        # Parse and store "extended" parameters: save_best model and postfix.
        self.always_save_nemo = always_save_nemo
        self.save_best_model = save_best_model
        self.postfix = postfix
        self.previous_best_path = ""
+        self.model_parallel_size = model_parallel_size

        # `prefix` is deprecated
        if 'prefix' in kwargs:
@ -678,7 +689,6 @@ class NeMoModelCheckpoint(ModelCheckpoint):
            self.kth_best_model_path
            self.best_model_score
            self.best_model_path
-            self._del_model
        except AttributeError:
            raise AttributeError("Lightning's ModelCheckpoint was updated. NeMoModelCheckpoint will need an update.")
        self.best_k_models = {}
@ -688,6 +698,7 @@ class NeMoModelCheckpoint(ModelCheckpoint):

        checkpoints = list(Path(self.dirpath).rglob("*.ckpt"))
        for checkpoint in checkpoints:
+            checkpoint = self._uninject_mp_rank(checkpoint)
            checkpoint = str(checkpoint)
            if checkpoint[-10:] == '-last.ckpt':
                continue
@ -706,7 +717,10 @@ class NeMoModelCheckpoint(ModelCheckpoint):

        ### This section should be ok as rank zero will delete all excess checkpoints, since all other ranks are
        ### instantiated after rank zero. models_to_delete should be 0 for all other ranks.
-        models_to_delete = len(best_k_models) - self.save_top_k
+        if self.model_parallel_size is not None:
+            models_to_delete = len(best_k_models) - self.model_parallel_size * self.save_top_k
+        else:
+            models_to_delete = len(best_k_models) - self.save_top_k
        logging.debug(f'Number of models to delete: {models_to_delete}')
        for _ in range(models_to_delete):
            model = best_k_models.pop(-1)
@ -718,6 +732,17 @@ class NeMoModelCheckpoint(ModelCheckpoint):
        self.best_model_path = best_k_models[0]
        self.best_model_score = self.best_k_models[self.best_model_path]

+        # # uninject mp_rank from paths
+        # self.kth_best_model_path = self._uninject_mp_rank(self.kth_best_model_path)
+        # self.best_model_path = self._uninject_mp_rank(self.best_model_path)
+
+    @staticmethod
+    def _uninject_mp_rank(filepath):
+        dirname = os.path.dirname(os.path.dirname(filepath))
+        basename = os.path.basename(filepath)
+        filepath = os.path.join(dirname, basename)
+        return filepath
+
    def on_save_checkpoint(self, trainer, pl_module, checkpoint):
        output = super().on_save_checkpoint(trainer, pl_module, checkpoint)
        if not self.always_save_nemo:
@ -756,41 +781,21 @@ class NeMoModelCheckpoint(ModelCheckpoint):
        if trainer.fast_dev_run:
            return None

-        monitor_candidates = self._monitor_candidates(trainer, trainer.current_epoch, trainer.global_step - 1)
-        self._save_last_checkpoint(trainer, monitor_candidates=monitor_candidates)
+        # Call parent on_train_end() to save the -last checkpoint
+        super().on_train_end(trainer, pl_module)

        # Load the best model and then re-save it
        if self.save_best_model:
-            if self.best_model_path is "":
+            if self.best_model_path == "":
                logging.warning(
                    f"{self} was told to save the best checkpoint at the end of training, but no saved checkpoints "
                    "were found. Saving latest model instead."
                )
            else:
                trainer.checkpoint_connector.restore(self.best_model_path)
+
        pl_module.save_to(save_path=os.path.join(self.dirpath, self.prefix + self.postfix))

-    def _del_model(self, trainer: "pl.Trainer", filepath: str) -> None:
-        """ Overrides PTL method to account for model parallel checkpoints.
-            Updates checkpoint path based on model parallel rank.
-        """
-        app_state = AppState()
-        if app_state.model_parallel_size is not None:
-            # filepath needs to be updated to include mp_rank
-            # TODO: figure out a good way to update these filepaths
-            dirname = os.path.dirname(filepath)
-            basename = os.path.basename(filepath)
-            filepath = f'{dirname}/mp_rank_{app_state.model_parallel_rank:02d}/{basename}'
-
-            # each model parallel rank needs to remove its model
-            if app_state.data_parallel_rank == 0:
-                if self._fs.exists(filepath):
-                    self._fs.rm(filepath)
-                    logging.info(f"Removed model parallel checkpoint: {filepath}")
-
-        else:
-            return super()._del_model(trainer, filepath)
-
    def _del_model_without_trainer(self, filepath: str) -> None:
        app_state = AppState()
        if app_state.model_parallel_size is not None:
@ -807,61 +812,6 @@ class NeMoModelCheckpoint(ModelCheckpoint):
            except:
                logging.info(f"Tried to remove checkpoint: {filepath} but failed.")

-    def _save_last_checkpoint(self, trainer: 'pl.Trainer', monitor_candidates: Dict[str, _METRIC]) -> None:
-        """ Overrides PTL method to account for model parallel checkpoints.
-            Checks for data parallel rank 0 rather than global rank 0.
-        """
-        app_state = AppState()
-        if app_state.model_parallel_size is not None:
-            if not self.save_last:
-                return
-
-            filepath = self._format_checkpoint_name(self.CHECKPOINT_NAME_LAST, monitor_candidates)
-            filepath = os.path.join(self.dirpath, f"{filepath}{self.FILE_EXTENSION}")
-
-            trainer.save_checkpoint(filepath)
-
-            # TODO: figure out where self.last_model_path is being set
-            if self.last_model_path is not None:
-                if 'mp_rank' in self.last_model_path:
-                    last_model_path = Path(self.last_model_path)
-                    last_model_path = last_model_path.parent.parent.joinpath(last_model_path.name)
-                    self.last_model_path = str(last_model_path)
-
-            # for model parallel we need to delete models for each model parallel rank
-            if self.last_model_path and self.last_model_path != filepath and app_state.data_parallel_rank == 0:
-                self._del_model(trainer, self.last_model_path)
-
-            self.last_model_path = filepath
-
-        else:
-            return super()._save_last_checkpoint(trainer, monitor_candidates)
-
-    def _save_none_monitor_checkpoint(self, trainer: 'pl.Trainer', monitor_candidates: Dict[str, _METRIC]) -> None:
-        """ Overrides PTL method to account for model parallel checkpoints.
-            Checks for data parallel rank 0 rather than global rank 0.
-        """
-        app_state = AppState()
-        if app_state.model_parallel_size is not None:
-            if self.monitor is not None or self.save_top_k == 0:
-                return
-
-            filepath = self._get_metric_interpolated_filepath_name(monitor_candidates, trainer)
-
-            trainer.save_checkpoint(filepath)
-
-            if (
-                self.save_top_k is None
-                and self.best_model_path
-                and self.best_model_path != filepath
-                and app_state.data_parallel_rank == 0
-            ):
-                self._del_model(trainer, self.best_model_path)
-
-            self.best_model_path = filepath
-        else:
-            return super()._save_none_monitor_checkpoint(trainer, monitor_candidates)
-

 def configure_checkpointing(
    trainer: 'pytorch_lightning.Trainer', log_dir: Path, name: str, resume: bool, params: 'DictConfig'
@ -924,6 +874,10 @@ def configure_checkpointing(

    checkpoint_callback = NeMoModelCheckpoint(n_resume=resume, **params)
    checkpoint_callback.last_model_path = trainer.checkpoint_connector.resume_checkpoint_path or ""
+    if params.model_parallel_size is not None:
+        checkpoint_callback.last_model_path = NeMoModelCheckpoint._uninject_mp_rank(
+            checkpoint_callback.last_model_path
+        )
    trainer.callbacks.append(checkpoint_callback)


--- a/reinstall.sh
+++ b/reinstall.sh
@ -10,6 +10,9 @@ echo 'Uninstalling stuff'
 ${PIP} uninstall -y nemo_toolkit
 ${PIP} uninstall -y sacrebleu

+# TODO: revert when 1.5.0 is out
+${PIP} uninstall -y pytorch-lightning
+
 # Kept for legacy purposes
 ${PIP} uninstall -y nemo_asr
 ${PIP} uninstall -y nemo_nlp
@ -19,6 +22,9 @@ ${PIP} uninstall -y nemo_cv

 ${PIP} install -U setuptools

+# TODO: revert when 1.5.0 is out
+${PIP} install pytorch-lightning==1.5.0rc0
+
 echo 'Installing nemo and nemo_text_processing'
 if [[ "$INSTALL_OPTION" == "dev" ]]; then
    ${PIP} install --editable ".[all]"
--- a/requirements/requirements_lightning.txt
+++ b/requirements/requirements_lightning.txt
@ -3,4 +3,5 @@ torchmetrics>=0.4.1rc0
 transformers>=4.0.1
 webdataset>=0.1.48,<=0.1.62
 omegaconf>=2.1.0
-hydra-core>=1.1.0
+hydra-core>=1.1.0
+pyyaml<6  # Pinned until omegaconf works with pyyaml>=6
--- a/tutorials/asr/Voice_Activity_Detection.ipynb
+++ b/tutorials/asr/Voice_Activity_Detection.ipynb
@ -1074,7 +1074,7 @@
   "source": [
    "## Adding evaluation metrics\n",
    "\n",
-    "Here is an example of how to use more metrics (e.g. from pytorch_lightning) to evaluate your result.\n",
+    "Here is an example of how to use more metrics (e.g. from torchmetrics) to evaluate your result.\n",
    "\n",
    "**Note:** If you would like to add metrics for training and testing, have a look at \n",
    "```python\n",
@ -1088,7 +1088,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "from pytorch_lightning.metrics.classification import ConfusionMatrix"
+    "from torchmetrics import ConfusionMatrix"
   ]
  },
  {
--- a/tutorials/nlp/Punctuation_and_Capitalization.ipynb
+++ b/tutorials/nlp/Punctuation_and_Capitalization.ipynb
@ -9,31 +9,6 @@
    "BRANCH = 'main'"
   ]
  },
-  "kernelspec": {
-   "name": "python3",
-   "display_name": "Python 3"
-  },
-  "accelerator": "GPU",
-  "pycharm": {
-   "stem_cell": {
-    "cell_type": "raw",
-    "source": [],
-    "metadata": {
-     "collapsed": false
-    }
-   }
-  }
- },
- "cells": [
-    {
-     "cell_type": "code",
-     "execution_count": null,
-     "metadata": {},
-     "outputs": [],
-     "source": [
-     "BRANCH = 'main'"
-     ]
-    },
  {
   "cell_type": "code",
   "execution_count": null,
@ -693,15 +668,13 @@
    "Alternatively you can pass text for restoring punctuation and capitalization as plain text. See help for parameters `--input_text` and `--output_text` of the script\n",
    "[punctuate_capitalize_infer.py](https://github.com/NVIDIA/NeMo/blob/stable/examples/nlp/token_classification/punctuate_capitalize_infer.py>).\n",
    "\n",
-    "The script [punctuate_capitalize_infer.py](https://github.com/NVIDIA/NeMo/blob/stable/examples/nlp/token_classification/punctuate_capitalize_infer.py>)\n",
-    "can restore punctuation and capitalization in a text of arbitrary length. Long sequences are split into segments\n",
+    "The script `examples/nlp/token_classification/punctuate_capitalize_infer.py` can restore punctuation and capitalization in a text of arbitrary length. Long sequences are split into segments\n",
    "`--max_seq_length - 2` tokens each. Each segment starts and ends with `[CLS]` and `[SEP]`\n",
    "tokens correspondingly. Every segment is offset to the previous one by `--step` tokens. For example, if\n",
    "every character is a token, `--max_seq_length=5`, `--step=2`, then text `\"hello\"` will be split into\n",
    "segments `[['[CLS]', 'h', 'e', 'l', '[SEP]'], ['[CLS]', 'l', 'l', 'o', '[SEP]']]`.\n",
    "\n",
-    "If segments overlap, then predicted probabilities for a token present in several segments are multiplied before\n",
-    "before selecting the best candidate.\n",
+    "If segments overlap, then predicted probabilities for a token present in several segments are multiplied before selecting the best candidate.\n",
    "\n",
    "Splitting leads to pour performance of a model near edges of segments. Use parameter `--margin` to discard `--margin`\n",
    "probabilities predicted for `--margin` tokens near segment edges. For example, if every character is a token, `--max_seq_length=5`, `--step=1`, `--margin=1`, then text `\"hello\"` will be split into segments `[['[CLS]', 'h', 'e', 'l', '[SEP]'], ['[CLS]', 'e', 'l', 'l', '[SEP]'], ['[CLS]', 'l', 'l', 'o', '[SEP]']]`. Before calculating actual predictions, probabilities for tokens marked by asterisk are removed: `[['[CLS]', 'h', 'e', 'l'*, '[SEP]'*], ['[CLS]'*, 'e'*, 'l', 'l'*, '[SEP]'*], ['[CLS]'*, 'l'*, 'l', 'o', '[SEP]']]`.\n",
@ -850,17 +823,6 @@
    "trainer = pl.Trainer(gpus=1, fast_dev_run=fast_dev_run)\n",
    "trainer.fit(pretrained_model)"
   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "colab": {},
-    "colab_type": "code",
-    "id": "l7A5FeiTl6Zd"
-   },
-   "outputs": [],
-   "source": []
  }
 ],
 "metadata": {
@ -888,7 +850,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.8.8"
+   "version": "3.7.7"
  },
  "pycharm": {
   "stem_cell": {
--- a/tutorials/tts/Inference_DurationPitchControl.ipynb
+++ b/tutorials/tts/Inference_DurationPitchControl.ipynb
@ -490,7 +490,7 @@
 ],
 "metadata": {
  "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
@ -504,7 +504,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.8.8"
+   "version": "3.8.10"
  }
 },
 "nbformat": 4,