Merge r1.5.0 bugfixes and doc updates to main (#3093)

* update branch

Signed-off-by: ericharper <complex451@gmail.com>

* Fix quantization bug in Asr (#3062)

Signed-off-by: smajumdar <titu1994@gmail.com>

* Update reinstall and cherry-pick bignlp commits (#3065)

* add ptl install to reinstall and update jenkins install

Signed-off-by: ericharper <complex451@gmail.com>

* Add a stateless timer to specify max_time per run instead of global m… (#3056)

* Add a stateless timer to specify max_time per run instead of global max_time across runs

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Style fixes

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* (1) reduce the validation loss within a epoch, (2) convert global-batch-based iteartion counts to micro-batch-based (#3055)

Co-authored-by: Oleksii Kuchaiev <okuchaiev@users.noreply.github.com>

* Timer class monitors total time (train + validation + testing) to monitor when to end training (#3061)

* Check total time in train/validation to exit

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Style fixes

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

Co-authored-by: Oleksii Kuchaiev <okuchaiev@users.noreply.github.com>

* Add PUBLICATIONS.md (#3051)

* Add PUBLICATIONS.md

Signed-off-by: smajumdar <titu1994@gmail.com>

* Add NLP

Signed-off-by: smajumdar <titu1994@gmail.com>

* Update PUBLICATIONS.md

* Update PUBLICATIONS.md

* Fix links

Signed-off-by: smajumdar <titu1994@gmail.com>

Co-authored-by: Eric Harper <complex451@gmail.com>

* fix uninstall

Signed-off-by: ericharper <complex451@gmail.com>

Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca>
Co-authored-by: Sangkug Lym <slym@nvidia.com>
Co-authored-by: Oleksii Kuchaiev <okuchaiev@users.noreply.github.com>
Co-authored-by: Somshubra Majumdar <titu1994@gmail.com>

* fix File Load Error (#3069)

Signed-off-by: fayejf <fayejf07@gmail.com>

Co-authored-by: Eric Harper <complex451@gmail.com>

* Update hyper parameter saving (#3058)

Signed-off-by: smajumdar <titu1994@gmail.com>

Co-authored-by: Eric Harper <complex451@gmail.com>

* Exp manager small refactor (#3067)

* Exp manager small refactor

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* move super() call earlier in the function

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

Co-authored-by: Somshubra Majumdar <titu1994@gmail.com>

* Fix FastPitch Pitch Duration Notebook (#3068)

* bugfix

Signed-off-by: Jason <jasoli@nvidia.com>

* bugfix2

Signed-off-by: Jason <jasoli@nvidia.com>

* better check

Signed-off-by: Jason <jasoli@nvidia.com>

* confusionmatrix (#3085)

Signed-off-by: fayejf <fayejf07@gmail.com>

* typo and fix link (#3086)

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* inf cross-checking across tensor-parallel ranks (#3088)

* inf cross-checking across tensor-parallel ranks

* sylte

Signed-off-by: ericharper <complex451@gmail.com>

Co-authored-by: Eric Harper <complex451@gmail.com>

* Fix save top k (#3075)

* inject mp_rank for checkpoint paths in NLPDDPPlugin

Signed-off-by: ericharper <complex451@gmail.com>

* == instead of i

Signed-off-by: ericharper <complex451@gmail.com>

* when checking previous run account for mp

Signed-off-by: ericharper <complex451@gmail.com>

* uninject mp ranks when needed

Signed-off-by: ericharper <complex451@gmail.com>

* style

Signed-off-by: ericharper <complex451@gmail.com>

* update branch

Signed-off-by: ericharper <complex451@gmail.com>

Co-authored-by: Somshubra Majumdar <titu1994@gmail.com>
Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca>
Co-authored-by: Sangkug Lym <slym@nvidia.com>
Co-authored-by: Oleksii Kuchaiev <okuchaiev@users.noreply.github.com>
Co-authored-by: fayejf <36722593+fayejf@users.noreply.github.com>
Co-authored-by: Jason <jasoli@nvidia.com>
Co-authored-by: Evelina <10428420+ekmb@users.noreply.github.com>
This commit is contained in:
Eric Harper 2021-10-29 20:15:37 -06:00 committed by GitHub
parent 648c97f076
commit 574b1014fd
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
13 changed files with 205 additions and 191 deletions

13
Jenkinsfile vendored
View file

@ -53,12 +53,6 @@ pipeline {
}
}
// Revert once import guards are added by PTL or version comparing is fixed
stage('Install PyTorch Lighting 1.5 RC') {
steps{
sh 'pip install pytorch-lightning==1.5.0rc0 && sed -i "s/from pytorch_lightning.callbacks.quantization import QuantizationAwareTraining/try:\\n\\tfrom pytorch_lightning.callbacks.quantization import QuantizationAwareTraining\\nexcept:\\n\\tpass/g" /opt/conda/lib/python3.8/site-packages/pytorch_lightning/callbacks/__init__.py'
}
}
stage('NeMo Installation') {
steps {
@ -66,6 +60,13 @@ pipeline {
}
}
// Revert once import guards are added by PTL or version comparing is fixed
stage('PTL Import Guards') {
steps{
sh 'sed -i "s/from pytorch_lightning.callbacks.quantization import QuantizationAwareTraining/try:\\n\\tfrom pytorch_lightning.callbacks.quantization import QuantizationAwareTraining\\nexcept:\\n\\tpass/g" /opt/conda/lib/python3.8/site-packages/pytorch_lightning/callbacks/__init__.py'
}
}
stage('PyTorch Lightning version') {
steps {
sh 'python -c "import pytorch_lightning; print(pytorch_lightning.__version__)"'

View file

@ -35,6 +35,7 @@ exp_manager:
mode: min
always_save_nemo: False # saves nemo file during validation, not implemented for model parallel
filename: 'megatron_gpt--{val_loss:.2f}-{step}-{consumed_samples}'
model_parallel_size: ${model.tensor_model_parallel_size}
model:

View file

@ -17,11 +17,11 @@ from pathlib import Path
from omegaconf.omegaconf import OmegaConf
from pytorch_lightning import Trainer
from pytorch_lightning.callbacks.timer import Timer
from pytorch_lightning.trainer.connectors.checkpoint_connector import CheckpointConnector
from nemo.collections.nlp.models.language_modeling.megatron_gpt_model import MegatronGPTModel
from nemo.collections.nlp.modules.common.megatron.megatron_utils import compute_model_parallel_rank
from nemo.collections.nlp.parts.nlp_overrides import (
NLPCheckpointConnector,
NLPDDPPlugin,
NLPNativeBfloat16PrecisionPlugin,
NLPNativeMixedPrecisionPlugin,
@ -69,7 +69,7 @@ def main(cfg) -> None:
resume_from_checkpoint = str(resume_from_checkpoint)
logging.info(f'Resuming training from checkpoint: {resume_from_checkpoint}')
trainer.checkpoint_connector = NLPCheckpointConnector(trainer, resume_from_checkpoint=resume_from_checkpoint)
trainer.checkpoint_connector = CheckpointConnector(trainer, resume_from_checkpoint=resume_from_checkpoint)
# Override timer callback to a stateless one
for idx, callback in enumerate(trainer.callbacks):
if isinstance(callback, Timer):

View file

@ -433,7 +433,7 @@ class SqueezeExcite(nn.Module):
if self.context_window < 0:
if PYTORCH_QUANTIZATION_AVAILABLE and self._quantize:
if not isinstance(self.pool, quant_nn.QuantAdaptiveAvgPool1d(1)):
if not isinstance(self.pool, quant_nn.QuantAdaptiveAvgPool1d):
self.pool = quant_nn.QuantAdaptiveAvgPool1d(1) # context window = T
elif not PYTORCH_QUANTIZATION_AVAILABLE and self._quantize:

View file

@ -15,6 +15,7 @@
import os
import shutil
import tempfile
from collections import defaultdict
from typing import Any, Dict, List, Optional, Union
import pytorch_lightning as pl
@ -30,12 +31,13 @@ from pytorch_lightning.trainer.trainer import Trainer
from pytorch_lightning.utilities import rank_zero_warn
from pytorch_lightning.utilities.cloud_io import atomic_save
from pytorch_lightning.utilities.enums import GradClipAlgorithmType
from pytorch_lightning.utilities.types import _PATH
from torch.nn.modules.module import Module
from torch.nn.parallel import DistributedDataParallel
from torch.optim.optimizer import Optimizer
from nemo.core.connectors.save_restore_connector import SaveRestoreConnector
from nemo.utils import AppState, logging
from nemo.utils import AppState, app_state, logging
try:
from apex.transformer import parallel_state
@ -127,26 +129,38 @@ class NLPDDPPlugin(DDPPlugin):
logging.info(f'mp_rank: {app_state.model_parallel_rank}')
logging.info(f'dp_rank: {app_state.data_parallel_rank}')
def save_checkpoint(self, checkpoint: Dict[str, Any], filepath: str) -> None:
"""Save model/training states as a checkpoint file through state-dump and file-write.
def save_checkpoint(self, checkpoint: Dict[str, Any], filepath: _PATH) -> None:
# PTL override to accomodate model parallel checkpoints
filepath = self._inject_model_parallel_rank(filepath)
return super().save_checkpoint(checkpoint, filepath)
Args:
checkpoint: dict containing model and trainer state
filepath: write-target file's path
"""
def remove_checkpoint(self, filepath: _PATH) -> None:
# PTL override to accomodate model parallel checkpoints
filepath = self._inject_model_parallel_rank(filepath)
logging.info(f'Removing checkpoint: {filepath}')
return super().remove_checkpoint(filepath)
def _inject_model_parallel_rank(self, filepath):
app_state = AppState()
# dump states as a checkpoint dictionary object
# TrainingTypePlugin.on_save() just seems to return the same thing.
# checkpoint = self.on_save(checkpoint)
if self.is_global_zero or app_state.data_parallel_rank == 0:
try:
# write the checkpoint dictionary on the file
atomic_save(checkpoint, filepath)
except AttributeError as err:
key = pl.LightningModule.CHECKPOINT_HYPER_PARAMS_KEY
checkpoint.pop(key, None)
rank_zero_warn(f"Warning, `{key}` dropped from checkpoint. An attribute is not picklable: {err}")
atomic_save(checkpoint, filepath)
# inserts mp_rank_XX for model parallel checkpoints
if app_state.model_parallel_size is not None and app_state.model_parallel_size > 1:
# filepath needs to be updated to include mp_rank
dirname = os.path.dirname(filepath)
basename = os.path.basename(filepath)
filepath = f'{dirname}/mp_rank_{app_state.model_parallel_rank:02d}/{basename}'
return filepath
else:
return filepath
@property
def should_rank_save_checkpoint(self) -> bool:
# PTL override that determines if checkpoints should be saved based on rank
# for model parallel we need data_parallel_rank==0
app_state = AppState()
if app_state.model_parallel_size is not None and app_state.model_parallel_size > 1:
return app_state.data_parallel_rank == 0
else:
return super().should_rank_save_checkpoint
@property
def distributed_sampler_kwargs(self):
@ -164,38 +178,6 @@ class NLPDDPPlugin(DDPPlugin):
return super(NLPDDPPlugin, self).distributed_sampler_kwargs
class NLPCheckpointConnector(CheckpointConnector):
""" Override PTL CheckpointConnector to support model parallel checkpoints from Megatron-LM.
"""
def __init__(self, trainer, resume_from_checkpoint):
super().__init__(trainer, resume_from_checkpoint)
if not HAVE_APEX:
logging.warning("Apex was not found. Using model parallel or megatron models will error out.")
def save_checkpoint(self, filepath, weights_only: bool = False) -> None:
"""Slightly modified version of PyTorch Lightning's save_checkpoint.
Accounts for model parallel training.
Save model/training states as a checkpoint file through state-dump and file-write.
Args:
filepath: write-target file's path
weights_only: saving model weights only
"""
app_state = AppState()
if app_state.model_parallel_size is not None:
# filepath needs to be updated to include mp_rank
dirname = os.path.dirname(filepath)
basename = os.path.basename(filepath)
filepath = f'{dirname}/mp_rank_{app_state.model_parallel_rank:02d}/{basename}'
_checkpoint = self.dump_checkpoint(weights_only)
# each model parallel rank needs to save a copy of its model
if app_state.data_parallel_rank == 0:
self.trainer.accelerator.save_checkpoint(_checkpoint, filepath)
else:
super().save_checkpoint(filepath, weights_only)
class NLPSaveRestoreConnector(SaveRestoreConnector):
def __init__(self) -> None:
super().__init__()
@ -243,11 +225,117 @@ class NLPSaveRestoreConnector(SaveRestoreConnector):
return super().save_to(model, save_path)
class GradScaler(torch.cuda.amp.GradScaler):
"""
Gradient sclaer for model-parallel inf check. The inf in gradients are checked across tensor-parallel
ranks in (1) executing optimizer step and (2) gradient scaler update.
"""
def __init__(
self, init_scale=2.0 ** 16, growth_factor=2.0, backoff_factor=0.5, growth_interval=2000, enabled=True
):
super().__init__(
init_scale=init_scale,
growth_factor=growth_factor,
backoff_factor=backoff_factor,
growth_interval=growth_interval,
enabled=enabled,
)
def _maybe_opt_step(self, optimizer, optimizer_state, *args, **kwargs):
retval = None
found_inf = torch.cuda.FloatTensor([sum(v.item() for v in optimizer_state["found_inf_per_device"].values())])
# Update across all model parallel instances.
torch.distributed.all_reduce(
found_inf, op=torch.distributed.ReduceOp.MAX, group=parallel_state.get_model_parallel_group()
)
if found_inf.item() == 0:
retval = optimizer.step(*args, **kwargs)
return retval
def update(self, new_scale=None):
"""
Updates the scale factor.
If any optimizer steps were skipped the scale is multiplied by ``backoff_factor``
to reduce it. If ``growth_interval`` unskipped iterations occurred consecutively,
the scale is multiplied by ``growth_factor`` to increase it.
Passing ``new_scale`` sets the new scale value manually. (``new_scale`` is not
used directly, it's used to fill GradScaler's internal scale tensor. So if
``new_scale`` was a tensor, later in-place changes to that tensor will not further
affect the scale GradScaler uses internally.)
Args:
new_scale (float or :class:`torch.cuda.FloatTensor`, optional, default=None): New scale factor.
.. warning::
:meth:`update` should only be called at the end of the iteration, after ``scaler.step(optimizer)`` has
been invoked for all optimizers used this iteration.
"""
if not self._enabled:
return
_scale, _growth_tracker = self._check_scale_growth_tracker("update")
if new_scale is not None:
# Accept a new user-defined scale.
if isinstance(new_scale, float):
self._scale.fill_(new_scale) # type: ignore[union-attr]
else:
reason = "new_scale should be a float or a 1-element torch.cuda.FloatTensor with requires_grad=False."
assert isinstance(new_scale, torch.cuda.FloatTensor), reason # type: ignore[attr-defined]
assert new_scale.numel() == 1, reason
assert new_scale.requires_grad is False, reason
self._scale.copy_(new_scale) # type: ignore[union-attr]
else:
# Consume shared inf/nan data collected from optimizers to update the scale.
# If all found_inf tensors are on the same device as self._scale, this operation is asynchronous.
found_infs = [
found_inf.to(device=_scale.device, non_blocking=True)
for state in self._per_optimizer_states.values()
for found_inf in state["found_inf_per_device"].values()
]
assert len(found_infs) > 0, "No inf checks were recorded prior to update."
found_inf_combined = found_infs[0]
# Update across all model parallel instances.
torch.distributed.all_reduce(
found_inf_combined, op=torch.distributed.ReduceOp.MAX, group=parallel_state.get_model_parallel_group()
)
if len(found_infs) > 1:
for i in range(1, len(found_infs)):
found_inf = found_infs[i]
# Update across all model parallel instances.
torch.distributed.all_reduce(
found_inf, op=torch.distributed.ReduceOp.MAX, group=parallel_state.get_model_parallel_group()
)
found_inf_combined += found_inf
torch._amp_update_scale_(
_scale,
_growth_tracker,
found_inf_combined,
self._growth_factor,
self._backoff_factor,
self._growth_interval,
)
# To prepare for next iteration, clear the data collected from optimizers this iteration.
self._per_optimizer_states = defaultdict(torch.cuda.amp.grad_scaler._refresh_per_optimizer_state)
class NLPNativeMixedPrecisionPlugin(NativeMixedPrecisionPlugin):
def __init__(self, init_scale: float = 2 ** 32, growth_interval: int = 1000) -> None:
super().__init__(precision=16)
self.scaler = torch.cuda.amp.GradScaler(init_scale=init_scale, growth_interval=growth_interval)
self.scaler = GradScaler(init_scale=init_scale, growth_interval=growth_interval)
def clip_gradients(
self,

View file

@ -243,7 +243,8 @@ class FastPitchModule(NeuralModule):
# Predict pitch
pitch_predicted = self.pitch_predictor(enc_out, enc_mask)
if pitch is not None:
if self.learn_alignment:
if self.learn_alignment and pitch.shape[-1] != pitch_predicted.shape[-1]:
# Pitch during training is per spectrogram frame, but during inference, it should be per character
pitch = average_pitch(pitch.unsqueeze(1), attn_hard_dur).squeeze(1)
pitch_emb = self.pitch_emb(pitch.unsqueeze(1))
else:

View file

@ -99,7 +99,7 @@ class ModelPT(LightningModule, Model):
self._cfg = cfg
self.save_hyperparameters(self._cfg)
self.save_hyperparameters("cfg")
self._train_dl = None
self._validation_dl = None
self._test_dl = None

View file

@ -80,6 +80,7 @@ class CallbackParams:
postfix: str = ".nemo"
save_best_model: bool = False
always_save_nemo: bool = False
model_parallel_size: Optional[int] = None
@dataclass
@ -115,6 +116,7 @@ class ExpManagerConfig:
# logs timing of train/val/test steps
log_step_timing: Optional[bool] = True
step_timing_kwargs: Optional[StepTimingParams] = StepTimingParams()
model_parallel_size: Optional[int] = None
class TimingCallback(Callback):
@ -652,12 +654,21 @@ class NeMoModelCheckpoint(ModelCheckpoint):
""" Light wrapper around Lightning's ModelCheckpoint to force a saved checkpoint on train_end
"""
def __init__(self, always_save_nemo=False, save_best_model=False, postfix=".nemo", n_resume=False, **kwargs):
def __init__(
self,
always_save_nemo=False,
save_best_model=False,
postfix=".nemo",
n_resume=False,
model_parallel_size=None,
**kwargs,
):
# Parse and store "extended" parameters: save_best model and postfix.
self.always_save_nemo = always_save_nemo
self.save_best_model = save_best_model
self.postfix = postfix
self.previous_best_path = ""
self.model_parallel_size = model_parallel_size
# `prefix` is deprecated
if 'prefix' in kwargs:
@ -678,7 +689,6 @@ class NeMoModelCheckpoint(ModelCheckpoint):
self.kth_best_model_path
self.best_model_score
self.best_model_path
self._del_model
except AttributeError:
raise AttributeError("Lightning's ModelCheckpoint was updated. NeMoModelCheckpoint will need an update.")
self.best_k_models = {}
@ -688,6 +698,7 @@ class NeMoModelCheckpoint(ModelCheckpoint):
checkpoints = list(Path(self.dirpath).rglob("*.ckpt"))
for checkpoint in checkpoints:
checkpoint = self._uninject_mp_rank(checkpoint)
checkpoint = str(checkpoint)
if checkpoint[-10:] == '-last.ckpt':
continue
@ -706,7 +717,10 @@ class NeMoModelCheckpoint(ModelCheckpoint):
### This section should be ok as rank zero will delete all excess checkpoints, since all other ranks are
### instantiated after rank zero. models_to_delete should be 0 for all other ranks.
models_to_delete = len(best_k_models) - self.save_top_k
if self.model_parallel_size is not None:
models_to_delete = len(best_k_models) - self.model_parallel_size * self.save_top_k
else:
models_to_delete = len(best_k_models) - self.save_top_k
logging.debug(f'Number of models to delete: {models_to_delete}')
for _ in range(models_to_delete):
model = best_k_models.pop(-1)
@ -718,6 +732,17 @@ class NeMoModelCheckpoint(ModelCheckpoint):
self.best_model_path = best_k_models[0]
self.best_model_score = self.best_k_models[self.best_model_path]
# # uninject mp_rank from paths
# self.kth_best_model_path = self._uninject_mp_rank(self.kth_best_model_path)
# self.best_model_path = self._uninject_mp_rank(self.best_model_path)
@staticmethod
def _uninject_mp_rank(filepath):
dirname = os.path.dirname(os.path.dirname(filepath))
basename = os.path.basename(filepath)
filepath = os.path.join(dirname, basename)
return filepath
def on_save_checkpoint(self, trainer, pl_module, checkpoint):
output = super().on_save_checkpoint(trainer, pl_module, checkpoint)
if not self.always_save_nemo:
@ -756,41 +781,21 @@ class NeMoModelCheckpoint(ModelCheckpoint):
if trainer.fast_dev_run:
return None
monitor_candidates = self._monitor_candidates(trainer, trainer.current_epoch, trainer.global_step - 1)
self._save_last_checkpoint(trainer, monitor_candidates=monitor_candidates)
# Call parent on_train_end() to save the -last checkpoint
super().on_train_end(trainer, pl_module)
# Load the best model and then re-save it
if self.save_best_model:
if self.best_model_path is "":
if self.best_model_path == "":
logging.warning(
f"{self} was told to save the best checkpoint at the end of training, but no saved checkpoints "
"were found. Saving latest model instead."
)
else:
trainer.checkpoint_connector.restore(self.best_model_path)
pl_module.save_to(save_path=os.path.join(self.dirpath, self.prefix + self.postfix))
def _del_model(self, trainer: "pl.Trainer", filepath: str) -> None:
""" Overrides PTL method to account for model parallel checkpoints.
Updates checkpoint path based on model parallel rank.
"""
app_state = AppState()
if app_state.model_parallel_size is not None:
# filepath needs to be updated to include mp_rank
# TODO: figure out a good way to update these filepaths
dirname = os.path.dirname(filepath)
basename = os.path.basename(filepath)
filepath = f'{dirname}/mp_rank_{app_state.model_parallel_rank:02d}/{basename}'
# each model parallel rank needs to remove its model
if app_state.data_parallel_rank == 0:
if self._fs.exists(filepath):
self._fs.rm(filepath)
logging.info(f"Removed model parallel checkpoint: {filepath}")
else:
return super()._del_model(trainer, filepath)
def _del_model_without_trainer(self, filepath: str) -> None:
app_state = AppState()
if app_state.model_parallel_size is not None:
@ -807,61 +812,6 @@ class NeMoModelCheckpoint(ModelCheckpoint):
except:
logging.info(f"Tried to remove checkpoint: {filepath} but failed.")
def _save_last_checkpoint(self, trainer: 'pl.Trainer', monitor_candidates: Dict[str, _METRIC]) -> None:
""" Overrides PTL method to account for model parallel checkpoints.
Checks for data parallel rank 0 rather than global rank 0.
"""
app_state = AppState()
if app_state.model_parallel_size is not None:
if not self.save_last:
return
filepath = self._format_checkpoint_name(self.CHECKPOINT_NAME_LAST, monitor_candidates)
filepath = os.path.join(self.dirpath, f"{filepath}{self.FILE_EXTENSION}")
trainer.save_checkpoint(filepath)
# TODO: figure out where self.last_model_path is being set
if self.last_model_path is not None:
if 'mp_rank' in self.last_model_path:
last_model_path = Path(self.last_model_path)
last_model_path = last_model_path.parent.parent.joinpath(last_model_path.name)
self.last_model_path = str(last_model_path)
# for model parallel we need to delete models for each model parallel rank
if self.last_model_path and self.last_model_path != filepath and app_state.data_parallel_rank == 0:
self._del_model(trainer, self.last_model_path)
self.last_model_path = filepath
else:
return super()._save_last_checkpoint(trainer, monitor_candidates)
def _save_none_monitor_checkpoint(self, trainer: 'pl.Trainer', monitor_candidates: Dict[str, _METRIC]) -> None:
""" Overrides PTL method to account for model parallel checkpoints.
Checks for data parallel rank 0 rather than global rank 0.
"""
app_state = AppState()
if app_state.model_parallel_size is not None:
if self.monitor is not None or self.save_top_k == 0:
return
filepath = self._get_metric_interpolated_filepath_name(monitor_candidates, trainer)
trainer.save_checkpoint(filepath)
if (
self.save_top_k is None
and self.best_model_path
and self.best_model_path != filepath
and app_state.data_parallel_rank == 0
):
self._del_model(trainer, self.best_model_path)
self.best_model_path = filepath
else:
return super()._save_none_monitor_checkpoint(trainer, monitor_candidates)
def configure_checkpointing(
trainer: 'pytorch_lightning.Trainer', log_dir: Path, name: str, resume: bool, params: 'DictConfig'
@ -924,6 +874,10 @@ def configure_checkpointing(
checkpoint_callback = NeMoModelCheckpoint(n_resume=resume, **params)
checkpoint_callback.last_model_path = trainer.checkpoint_connector.resume_checkpoint_path or ""
if params.model_parallel_size is not None:
checkpoint_callback.last_model_path = NeMoModelCheckpoint._uninject_mp_rank(
checkpoint_callback.last_model_path
)
trainer.callbacks.append(checkpoint_callback)

View file

@ -10,6 +10,9 @@ echo 'Uninstalling stuff'
${PIP} uninstall -y nemo_toolkit
${PIP} uninstall -y sacrebleu
# TODO: revert when 1.5.0 is out
${PIP} uninstall -y pytorch-lightning
# Kept for legacy purposes
${PIP} uninstall -y nemo_asr
${PIP} uninstall -y nemo_nlp
@ -19,6 +22,9 @@ ${PIP} uninstall -y nemo_cv
${PIP} install -U setuptools
# TODO: revert when 1.5.0 is out
${PIP} install pytorch-lightning==1.5.0rc0
echo 'Installing nemo and nemo_text_processing'
if [[ "$INSTALL_OPTION" == "dev" ]]; then
${PIP} install --editable ".[all]"

View file

@ -3,4 +3,5 @@ torchmetrics>=0.4.1rc0
transformers>=4.0.1
webdataset>=0.1.48,<=0.1.62
omegaconf>=2.1.0
hydra-core>=1.1.0
hydra-core>=1.1.0
pyyaml<6 # Pinned until omegaconf works with pyyaml>=6

View file

@ -1074,7 +1074,7 @@
"source": [
"## Adding evaluation metrics\n",
"\n",
"Here is an example of how to use more metrics (e.g. from pytorch_lightning) to evaluate your result.\n",
"Here is an example of how to use more metrics (e.g. from torchmetrics) to evaluate your result.\n",
"\n",
"**Note:** If you would like to add metrics for training and testing, have a look at \n",
"```python\n",
@ -1088,7 +1088,7 @@
"metadata": {},
"outputs": [],
"source": [
"from pytorch_lightning.metrics.classification import ConfusionMatrix"
"from torchmetrics import ConfusionMatrix"
]
},
{

View file

@ -9,31 +9,6 @@
"BRANCH = 'main'"
]
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"accelerator": "GPU",
"pycharm": {
"stem_cell": {
"cell_type": "raw",
"source": [],
"metadata": {
"collapsed": false
}
}
}
},
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"BRANCH = 'main'"
]
},
{
"cell_type": "code",
"execution_count": null,
@ -693,15 +668,13 @@
"Alternatively you can pass text for restoring punctuation and capitalization as plain text. See help for parameters `--input_text` and `--output_text` of the script\n",
"[punctuate_capitalize_infer.py](https://github.com/NVIDIA/NeMo/blob/stable/examples/nlp/token_classification/punctuate_capitalize_infer.py>).\n",
"\n",
"The script [punctuate_capitalize_infer.py](https://github.com/NVIDIA/NeMo/blob/stable/examples/nlp/token_classification/punctuate_capitalize_infer.py>)\n",
"can restore punctuation and capitalization in a text of arbitrary length. Long sequences are split into segments\n",
"The script `examples/nlp/token_classification/punctuate_capitalize_infer.py` can restore punctuation and capitalization in a text of arbitrary length. Long sequences are split into segments\n",
"`--max_seq_length - 2` tokens each. Each segment starts and ends with `[CLS]` and `[SEP]`\n",
"tokens correspondingly. Every segment is offset to the previous one by `--step` tokens. For example, if\n",
"every character is a token, `--max_seq_length=5`, `--step=2`, then text `\"hello\"` will be split into\n",
"segments `[['[CLS]', 'h', 'e', 'l', '[SEP]'], ['[CLS]', 'l', 'l', 'o', '[SEP]']]`.\n",
"\n",
"If segments overlap, then predicted probabilities for a token present in several segments are multiplied before\n",
"before selecting the best candidate.\n",
"If segments overlap, then predicted probabilities for a token present in several segments are multiplied before selecting the best candidate.\n",
"\n",
"Splitting leads to pour performance of a model near edges of segments. Use parameter `--margin` to discard `--margin`\n",
"probabilities predicted for `--margin` tokens near segment edges. For example, if every character is a token, `--max_seq_length=5`, `--step=1`, `--margin=1`, then text `\"hello\"` will be split into segments `[['[CLS]', 'h', 'e', 'l', '[SEP]'], ['[CLS]', 'e', 'l', 'l', '[SEP]'], ['[CLS]', 'l', 'l', 'o', '[SEP]']]`. Before calculating actual predictions, probabilities for tokens marked by asterisk are removed: `[['[CLS]', 'h', 'e', 'l'*, '[SEP]'*], ['[CLS]'*, 'e'*, 'l', 'l'*, '[SEP]'*], ['[CLS]'*, 'l'*, 'l', 'o', '[SEP]']]`.\n",
@ -850,17 +823,6 @@
"trainer = pl.Trainer(gpus=1, fast_dev_run=fast_dev_run)\n",
"trainer.fit(pretrained_model)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "l7A5FeiTl6Zd"
},
"outputs": [],
"source": []
}
],
"metadata": {
@ -888,7 +850,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.8"
"version": "3.7.7"
},
"pycharm": {
"stem_cell": {

View file

@ -490,7 +490,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
@ -504,7 +504,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.8"
"version": "3.8.10"
}
},
"nbformat": 4,