Merge r1.5.0 bugfixes and doc updates to main (#3093)
* update branch Signed-off-by: ericharper <complex451@gmail.com> * Fix quantization bug in Asr (#3062) Signed-off-by: smajumdar <titu1994@gmail.com> * Update reinstall and cherry-pick bignlp commits (#3065) * add ptl install to reinstall and update jenkins install Signed-off-by: ericharper <complex451@gmail.com> * Add a stateless timer to specify max_time per run instead of global m… (#3056) * Add a stateless timer to specify max_time per run instead of global max_time across runs Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Style fixes Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * (1) reduce the validation loss within a epoch, (2) convert global-batch-based iteartion counts to micro-batch-based (#3055) Co-authored-by: Oleksii Kuchaiev <okuchaiev@users.noreply.github.com> * Timer class monitors total time (train + validation + testing) to monitor when to end training (#3061) * Check total time in train/validation to exit Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Style fixes Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> Co-authored-by: Oleksii Kuchaiev <okuchaiev@users.noreply.github.com> * Add PUBLICATIONS.md (#3051) * Add PUBLICATIONS.md Signed-off-by: smajumdar <titu1994@gmail.com> * Add NLP Signed-off-by: smajumdar <titu1994@gmail.com> * Update PUBLICATIONS.md * Update PUBLICATIONS.md * Fix links Signed-off-by: smajumdar <titu1994@gmail.com> Co-authored-by: Eric Harper <complex451@gmail.com> * fix uninstall Signed-off-by: ericharper <complex451@gmail.com> Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca> Co-authored-by: Sangkug Lym <slym@nvidia.com> Co-authored-by: Oleksii Kuchaiev <okuchaiev@users.noreply.github.com> Co-authored-by: Somshubra Majumdar <titu1994@gmail.com> * fix File Load Error (#3069) Signed-off-by: fayejf <fayejf07@gmail.com> Co-authored-by: Eric Harper <complex451@gmail.com> * Update hyper parameter saving (#3058) Signed-off-by: smajumdar <titu1994@gmail.com> Co-authored-by: Eric Harper <complex451@gmail.com> * Exp manager small refactor (#3067) * Exp manager small refactor Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * move super() call earlier in the function Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> Co-authored-by: Somshubra Majumdar <titu1994@gmail.com> * Fix FastPitch Pitch Duration Notebook (#3068) * bugfix Signed-off-by: Jason <jasoli@nvidia.com> * bugfix2 Signed-off-by: Jason <jasoli@nvidia.com> * better check Signed-off-by: Jason <jasoli@nvidia.com> * confusionmatrix (#3085) Signed-off-by: fayejf <fayejf07@gmail.com> * typo and fix link (#3086) Signed-off-by: ekmb <ebakhturina@nvidia.com> * inf cross-checking across tensor-parallel ranks (#3088) * inf cross-checking across tensor-parallel ranks * sylte Signed-off-by: ericharper <complex451@gmail.com> Co-authored-by: Eric Harper <complex451@gmail.com> * Fix save top k (#3075) * inject mp_rank for checkpoint paths in NLPDDPPlugin Signed-off-by: ericharper <complex451@gmail.com> * == instead of i Signed-off-by: ericharper <complex451@gmail.com> * when checking previous run account for mp Signed-off-by: ericharper <complex451@gmail.com> * uninject mp ranks when needed Signed-off-by: ericharper <complex451@gmail.com> * style Signed-off-by: ericharper <complex451@gmail.com> * update branch Signed-off-by: ericharper <complex451@gmail.com> Co-authored-by: Somshubra Majumdar <titu1994@gmail.com> Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca> Co-authored-by: Sangkug Lym <slym@nvidia.com> Co-authored-by: Oleksii Kuchaiev <okuchaiev@users.noreply.github.com> Co-authored-by: fayejf <36722593+fayejf@users.noreply.github.com> Co-authored-by: Jason <jasoli@nvidia.com> Co-authored-by: Evelina <10428420+ekmb@users.noreply.github.com>
This commit is contained in:
parent
648c97f076
commit
574b1014fd
13
Jenkinsfile
vendored
13
Jenkinsfile
vendored
|
@ -53,12 +53,6 @@ pipeline {
|
|||
}
|
||||
}
|
||||
|
||||
// Revert once import guards are added by PTL or version comparing is fixed
|
||||
stage('Install PyTorch Lighting 1.5 RC') {
|
||||
steps{
|
||||
sh 'pip install pytorch-lightning==1.5.0rc0 && sed -i "s/from pytorch_lightning.callbacks.quantization import QuantizationAwareTraining/try:\\n\\tfrom pytorch_lightning.callbacks.quantization import QuantizationAwareTraining\\nexcept:\\n\\tpass/g" /opt/conda/lib/python3.8/site-packages/pytorch_lightning/callbacks/__init__.py'
|
||||
}
|
||||
}
|
||||
|
||||
stage('NeMo Installation') {
|
||||
steps {
|
||||
|
@ -66,6 +60,13 @@ pipeline {
|
|||
}
|
||||
}
|
||||
|
||||
// Revert once import guards are added by PTL or version comparing is fixed
|
||||
stage('PTL Import Guards') {
|
||||
steps{
|
||||
sh 'sed -i "s/from pytorch_lightning.callbacks.quantization import QuantizationAwareTraining/try:\\n\\tfrom pytorch_lightning.callbacks.quantization import QuantizationAwareTraining\\nexcept:\\n\\tpass/g" /opt/conda/lib/python3.8/site-packages/pytorch_lightning/callbacks/__init__.py'
|
||||
}
|
||||
}
|
||||
|
||||
stage('PyTorch Lightning version') {
|
||||
steps {
|
||||
sh 'python -c "import pytorch_lightning; print(pytorch_lightning.__version__)"'
|
||||
|
|
|
@ -35,6 +35,7 @@ exp_manager:
|
|||
mode: min
|
||||
always_save_nemo: False # saves nemo file during validation, not implemented for model parallel
|
||||
filename: 'megatron_gpt--{val_loss:.2f}-{step}-{consumed_samples}'
|
||||
model_parallel_size: ${model.tensor_model_parallel_size}
|
||||
|
||||
|
||||
model:
|
||||
|
|
|
@ -17,11 +17,11 @@ from pathlib import Path
|
|||
from omegaconf.omegaconf import OmegaConf
|
||||
from pytorch_lightning import Trainer
|
||||
from pytorch_lightning.callbacks.timer import Timer
|
||||
from pytorch_lightning.trainer.connectors.checkpoint_connector import CheckpointConnector
|
||||
|
||||
from nemo.collections.nlp.models.language_modeling.megatron_gpt_model import MegatronGPTModel
|
||||
from nemo.collections.nlp.modules.common.megatron.megatron_utils import compute_model_parallel_rank
|
||||
from nemo.collections.nlp.parts.nlp_overrides import (
|
||||
NLPCheckpointConnector,
|
||||
NLPDDPPlugin,
|
||||
NLPNativeBfloat16PrecisionPlugin,
|
||||
NLPNativeMixedPrecisionPlugin,
|
||||
|
@ -69,7 +69,7 @@ def main(cfg) -> None:
|
|||
resume_from_checkpoint = str(resume_from_checkpoint)
|
||||
logging.info(f'Resuming training from checkpoint: {resume_from_checkpoint}')
|
||||
|
||||
trainer.checkpoint_connector = NLPCheckpointConnector(trainer, resume_from_checkpoint=resume_from_checkpoint)
|
||||
trainer.checkpoint_connector = CheckpointConnector(trainer, resume_from_checkpoint=resume_from_checkpoint)
|
||||
# Override timer callback to a stateless one
|
||||
for idx, callback in enumerate(trainer.callbacks):
|
||||
if isinstance(callback, Timer):
|
||||
|
|
|
@ -433,7 +433,7 @@ class SqueezeExcite(nn.Module):
|
|||
|
||||
if self.context_window < 0:
|
||||
if PYTORCH_QUANTIZATION_AVAILABLE and self._quantize:
|
||||
if not isinstance(self.pool, quant_nn.QuantAdaptiveAvgPool1d(1)):
|
||||
if not isinstance(self.pool, quant_nn.QuantAdaptiveAvgPool1d):
|
||||
self.pool = quant_nn.QuantAdaptiveAvgPool1d(1) # context window = T
|
||||
|
||||
elif not PYTORCH_QUANTIZATION_AVAILABLE and self._quantize:
|
||||
|
|
|
@ -15,6 +15,7 @@
|
|||
import os
|
||||
import shutil
|
||||
import tempfile
|
||||
from collections import defaultdict
|
||||
from typing import Any, Dict, List, Optional, Union
|
||||
|
||||
import pytorch_lightning as pl
|
||||
|
@ -30,12 +31,13 @@ from pytorch_lightning.trainer.trainer import Trainer
|
|||
from pytorch_lightning.utilities import rank_zero_warn
|
||||
from pytorch_lightning.utilities.cloud_io import atomic_save
|
||||
from pytorch_lightning.utilities.enums import GradClipAlgorithmType
|
||||
from pytorch_lightning.utilities.types import _PATH
|
||||
from torch.nn.modules.module import Module
|
||||
from torch.nn.parallel import DistributedDataParallel
|
||||
from torch.optim.optimizer import Optimizer
|
||||
|
||||
from nemo.core.connectors.save_restore_connector import SaveRestoreConnector
|
||||
from nemo.utils import AppState, logging
|
||||
from nemo.utils import AppState, app_state, logging
|
||||
|
||||
try:
|
||||
from apex.transformer import parallel_state
|
||||
|
@ -127,26 +129,38 @@ class NLPDDPPlugin(DDPPlugin):
|
|||
logging.info(f'mp_rank: {app_state.model_parallel_rank}')
|
||||
logging.info(f'dp_rank: {app_state.data_parallel_rank}')
|
||||
|
||||
def save_checkpoint(self, checkpoint: Dict[str, Any], filepath: str) -> None:
|
||||
"""Save model/training states as a checkpoint file through state-dump and file-write.
|
||||
def save_checkpoint(self, checkpoint: Dict[str, Any], filepath: _PATH) -> None:
|
||||
# PTL override to accomodate model parallel checkpoints
|
||||
filepath = self._inject_model_parallel_rank(filepath)
|
||||
return super().save_checkpoint(checkpoint, filepath)
|
||||
|
||||
Args:
|
||||
checkpoint: dict containing model and trainer state
|
||||
filepath: write-target file's path
|
||||
"""
|
||||
def remove_checkpoint(self, filepath: _PATH) -> None:
|
||||
# PTL override to accomodate model parallel checkpoints
|
||||
filepath = self._inject_model_parallel_rank(filepath)
|
||||
logging.info(f'Removing checkpoint: {filepath}')
|
||||
return super().remove_checkpoint(filepath)
|
||||
|
||||
def _inject_model_parallel_rank(self, filepath):
|
||||
app_state = AppState()
|
||||
# dump states as a checkpoint dictionary object
|
||||
# TrainingTypePlugin.on_save() just seems to return the same thing.
|
||||
# checkpoint = self.on_save(checkpoint)
|
||||
if self.is_global_zero or app_state.data_parallel_rank == 0:
|
||||
try:
|
||||
# write the checkpoint dictionary on the file
|
||||
atomic_save(checkpoint, filepath)
|
||||
except AttributeError as err:
|
||||
key = pl.LightningModule.CHECKPOINT_HYPER_PARAMS_KEY
|
||||
checkpoint.pop(key, None)
|
||||
rank_zero_warn(f"Warning, `{key}` dropped from checkpoint. An attribute is not picklable: {err}")
|
||||
atomic_save(checkpoint, filepath)
|
||||
# inserts mp_rank_XX for model parallel checkpoints
|
||||
if app_state.model_parallel_size is not None and app_state.model_parallel_size > 1:
|
||||
# filepath needs to be updated to include mp_rank
|
||||
dirname = os.path.dirname(filepath)
|
||||
basename = os.path.basename(filepath)
|
||||
filepath = f'{dirname}/mp_rank_{app_state.model_parallel_rank:02d}/{basename}'
|
||||
return filepath
|
||||
else:
|
||||
return filepath
|
||||
|
||||
@property
|
||||
def should_rank_save_checkpoint(self) -> bool:
|
||||
# PTL override that determines if checkpoints should be saved based on rank
|
||||
# for model parallel we need data_parallel_rank==0
|
||||
app_state = AppState()
|
||||
if app_state.model_parallel_size is not None and app_state.model_parallel_size > 1:
|
||||
return app_state.data_parallel_rank == 0
|
||||
else:
|
||||
return super().should_rank_save_checkpoint
|
||||
|
||||
@property
|
||||
def distributed_sampler_kwargs(self):
|
||||
|
@ -164,38 +178,6 @@ class NLPDDPPlugin(DDPPlugin):
|
|||
return super(NLPDDPPlugin, self).distributed_sampler_kwargs
|
||||
|
||||
|
||||
class NLPCheckpointConnector(CheckpointConnector):
|
||||
""" Override PTL CheckpointConnector to support model parallel checkpoints from Megatron-LM.
|
||||
"""
|
||||
|
||||
def __init__(self, trainer, resume_from_checkpoint):
|
||||
super().__init__(trainer, resume_from_checkpoint)
|
||||
if not HAVE_APEX:
|
||||
logging.warning("Apex was not found. Using model parallel or megatron models will error out.")
|
||||
|
||||
def save_checkpoint(self, filepath, weights_only: bool = False) -> None:
|
||||
"""Slightly modified version of PyTorch Lightning's save_checkpoint.
|
||||
Accounts for model parallel training.
|
||||
Save model/training states as a checkpoint file through state-dump and file-write.
|
||||
|
||||
Args:
|
||||
filepath: write-target file's path
|
||||
weights_only: saving model weights only
|
||||
"""
|
||||
app_state = AppState()
|
||||
if app_state.model_parallel_size is not None:
|
||||
# filepath needs to be updated to include mp_rank
|
||||
dirname = os.path.dirname(filepath)
|
||||
basename = os.path.basename(filepath)
|
||||
filepath = f'{dirname}/mp_rank_{app_state.model_parallel_rank:02d}/{basename}'
|
||||
_checkpoint = self.dump_checkpoint(weights_only)
|
||||
# each model parallel rank needs to save a copy of its model
|
||||
if app_state.data_parallel_rank == 0:
|
||||
self.trainer.accelerator.save_checkpoint(_checkpoint, filepath)
|
||||
else:
|
||||
super().save_checkpoint(filepath, weights_only)
|
||||
|
||||
|
||||
class NLPSaveRestoreConnector(SaveRestoreConnector):
|
||||
def __init__(self) -> None:
|
||||
super().__init__()
|
||||
|
@ -243,11 +225,117 @@ class NLPSaveRestoreConnector(SaveRestoreConnector):
|
|||
return super().save_to(model, save_path)
|
||||
|
||||
|
||||
class GradScaler(torch.cuda.amp.GradScaler):
|
||||
"""
|
||||
Gradient sclaer for model-parallel inf check. The inf in gradients are checked across tensor-parallel
|
||||
ranks in (1) executing optimizer step and (2) gradient scaler update.
|
||||
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self, init_scale=2.0 ** 16, growth_factor=2.0, backoff_factor=0.5, growth_interval=2000, enabled=True
|
||||
):
|
||||
super().__init__(
|
||||
init_scale=init_scale,
|
||||
growth_factor=growth_factor,
|
||||
backoff_factor=backoff_factor,
|
||||
growth_interval=growth_interval,
|
||||
enabled=enabled,
|
||||
)
|
||||
|
||||
def _maybe_opt_step(self, optimizer, optimizer_state, *args, **kwargs):
|
||||
retval = None
|
||||
found_inf = torch.cuda.FloatTensor([sum(v.item() for v in optimizer_state["found_inf_per_device"].values())])
|
||||
|
||||
# Update across all model parallel instances.
|
||||
torch.distributed.all_reduce(
|
||||
found_inf, op=torch.distributed.ReduceOp.MAX, group=parallel_state.get_model_parallel_group()
|
||||
)
|
||||
|
||||
if found_inf.item() == 0:
|
||||
retval = optimizer.step(*args, **kwargs)
|
||||
return retval
|
||||
|
||||
def update(self, new_scale=None):
|
||||
"""
|
||||
Updates the scale factor.
|
||||
|
||||
If any optimizer steps were skipped the scale is multiplied by ``backoff_factor``
|
||||
to reduce it. If ``growth_interval`` unskipped iterations occurred consecutively,
|
||||
the scale is multiplied by ``growth_factor`` to increase it.
|
||||
|
||||
Passing ``new_scale`` sets the new scale value manually. (``new_scale`` is not
|
||||
used directly, it's used to fill GradScaler's internal scale tensor. So if
|
||||
``new_scale`` was a tensor, later in-place changes to that tensor will not further
|
||||
affect the scale GradScaler uses internally.)
|
||||
|
||||
Args:
|
||||
new_scale (float or :class:`torch.cuda.FloatTensor`, optional, default=None): New scale factor.
|
||||
|
||||
.. warning::
|
||||
:meth:`update` should only be called at the end of the iteration, after ``scaler.step(optimizer)`` has
|
||||
been invoked for all optimizers used this iteration.
|
||||
"""
|
||||
if not self._enabled:
|
||||
return
|
||||
|
||||
_scale, _growth_tracker = self._check_scale_growth_tracker("update")
|
||||
|
||||
if new_scale is not None:
|
||||
# Accept a new user-defined scale.
|
||||
if isinstance(new_scale, float):
|
||||
self._scale.fill_(new_scale) # type: ignore[union-attr]
|
||||
else:
|
||||
reason = "new_scale should be a float or a 1-element torch.cuda.FloatTensor with requires_grad=False."
|
||||
assert isinstance(new_scale, torch.cuda.FloatTensor), reason # type: ignore[attr-defined]
|
||||
assert new_scale.numel() == 1, reason
|
||||
assert new_scale.requires_grad is False, reason
|
||||
self._scale.copy_(new_scale) # type: ignore[union-attr]
|
||||
else:
|
||||
# Consume shared inf/nan data collected from optimizers to update the scale.
|
||||
# If all found_inf tensors are on the same device as self._scale, this operation is asynchronous.
|
||||
found_infs = [
|
||||
found_inf.to(device=_scale.device, non_blocking=True)
|
||||
for state in self._per_optimizer_states.values()
|
||||
for found_inf in state["found_inf_per_device"].values()
|
||||
]
|
||||
|
||||
assert len(found_infs) > 0, "No inf checks were recorded prior to update."
|
||||
|
||||
found_inf_combined = found_infs[0]
|
||||
|
||||
# Update across all model parallel instances.
|
||||
torch.distributed.all_reduce(
|
||||
found_inf_combined, op=torch.distributed.ReduceOp.MAX, group=parallel_state.get_model_parallel_group()
|
||||
)
|
||||
|
||||
if len(found_infs) > 1:
|
||||
for i in range(1, len(found_infs)):
|
||||
found_inf = found_infs[i]
|
||||
# Update across all model parallel instances.
|
||||
torch.distributed.all_reduce(
|
||||
found_inf, op=torch.distributed.ReduceOp.MAX, group=parallel_state.get_model_parallel_group()
|
||||
)
|
||||
found_inf_combined += found_inf
|
||||
|
||||
torch._amp_update_scale_(
|
||||
_scale,
|
||||
_growth_tracker,
|
||||
found_inf_combined,
|
||||
self._growth_factor,
|
||||
self._backoff_factor,
|
||||
self._growth_interval,
|
||||
)
|
||||
|
||||
# To prepare for next iteration, clear the data collected from optimizers this iteration.
|
||||
self._per_optimizer_states = defaultdict(torch.cuda.amp.grad_scaler._refresh_per_optimizer_state)
|
||||
|
||||
|
||||
class NLPNativeMixedPrecisionPlugin(NativeMixedPrecisionPlugin):
|
||||
def __init__(self, init_scale: float = 2 ** 32, growth_interval: int = 1000) -> None:
|
||||
super().__init__(precision=16)
|
||||
|
||||
self.scaler = torch.cuda.amp.GradScaler(init_scale=init_scale, growth_interval=growth_interval)
|
||||
self.scaler = GradScaler(init_scale=init_scale, growth_interval=growth_interval)
|
||||
|
||||
def clip_gradients(
|
||||
self,
|
||||
|
|
|
@ -243,7 +243,8 @@ class FastPitchModule(NeuralModule):
|
|||
# Predict pitch
|
||||
pitch_predicted = self.pitch_predictor(enc_out, enc_mask)
|
||||
if pitch is not None:
|
||||
if self.learn_alignment:
|
||||
if self.learn_alignment and pitch.shape[-1] != pitch_predicted.shape[-1]:
|
||||
# Pitch during training is per spectrogram frame, but during inference, it should be per character
|
||||
pitch = average_pitch(pitch.unsqueeze(1), attn_hard_dur).squeeze(1)
|
||||
pitch_emb = self.pitch_emb(pitch.unsqueeze(1))
|
||||
else:
|
||||
|
|
|
@ -99,7 +99,7 @@ class ModelPT(LightningModule, Model):
|
|||
|
||||
self._cfg = cfg
|
||||
|
||||
self.save_hyperparameters(self._cfg)
|
||||
self.save_hyperparameters("cfg")
|
||||
self._train_dl = None
|
||||
self._validation_dl = None
|
||||
self._test_dl = None
|
||||
|
|
|
@ -80,6 +80,7 @@ class CallbackParams:
|
|||
postfix: str = ".nemo"
|
||||
save_best_model: bool = False
|
||||
always_save_nemo: bool = False
|
||||
model_parallel_size: Optional[int] = None
|
||||
|
||||
|
||||
@dataclass
|
||||
|
@ -115,6 +116,7 @@ class ExpManagerConfig:
|
|||
# logs timing of train/val/test steps
|
||||
log_step_timing: Optional[bool] = True
|
||||
step_timing_kwargs: Optional[StepTimingParams] = StepTimingParams()
|
||||
model_parallel_size: Optional[int] = None
|
||||
|
||||
|
||||
class TimingCallback(Callback):
|
||||
|
@ -652,12 +654,21 @@ class NeMoModelCheckpoint(ModelCheckpoint):
|
|||
""" Light wrapper around Lightning's ModelCheckpoint to force a saved checkpoint on train_end
|
||||
"""
|
||||
|
||||
def __init__(self, always_save_nemo=False, save_best_model=False, postfix=".nemo", n_resume=False, **kwargs):
|
||||
def __init__(
|
||||
self,
|
||||
always_save_nemo=False,
|
||||
save_best_model=False,
|
||||
postfix=".nemo",
|
||||
n_resume=False,
|
||||
model_parallel_size=None,
|
||||
**kwargs,
|
||||
):
|
||||
# Parse and store "extended" parameters: save_best model and postfix.
|
||||
self.always_save_nemo = always_save_nemo
|
||||
self.save_best_model = save_best_model
|
||||
self.postfix = postfix
|
||||
self.previous_best_path = ""
|
||||
self.model_parallel_size = model_parallel_size
|
||||
|
||||
# `prefix` is deprecated
|
||||
if 'prefix' in kwargs:
|
||||
|
@ -678,7 +689,6 @@ class NeMoModelCheckpoint(ModelCheckpoint):
|
|||
self.kth_best_model_path
|
||||
self.best_model_score
|
||||
self.best_model_path
|
||||
self._del_model
|
||||
except AttributeError:
|
||||
raise AttributeError("Lightning's ModelCheckpoint was updated. NeMoModelCheckpoint will need an update.")
|
||||
self.best_k_models = {}
|
||||
|
@ -688,6 +698,7 @@ class NeMoModelCheckpoint(ModelCheckpoint):
|
|||
|
||||
checkpoints = list(Path(self.dirpath).rglob("*.ckpt"))
|
||||
for checkpoint in checkpoints:
|
||||
checkpoint = self._uninject_mp_rank(checkpoint)
|
||||
checkpoint = str(checkpoint)
|
||||
if checkpoint[-10:] == '-last.ckpt':
|
||||
continue
|
||||
|
@ -706,7 +717,10 @@ class NeMoModelCheckpoint(ModelCheckpoint):
|
|||
|
||||
### This section should be ok as rank zero will delete all excess checkpoints, since all other ranks are
|
||||
### instantiated after rank zero. models_to_delete should be 0 for all other ranks.
|
||||
models_to_delete = len(best_k_models) - self.save_top_k
|
||||
if self.model_parallel_size is not None:
|
||||
models_to_delete = len(best_k_models) - self.model_parallel_size * self.save_top_k
|
||||
else:
|
||||
models_to_delete = len(best_k_models) - self.save_top_k
|
||||
logging.debug(f'Number of models to delete: {models_to_delete}')
|
||||
for _ in range(models_to_delete):
|
||||
model = best_k_models.pop(-1)
|
||||
|
@ -718,6 +732,17 @@ class NeMoModelCheckpoint(ModelCheckpoint):
|
|||
self.best_model_path = best_k_models[0]
|
||||
self.best_model_score = self.best_k_models[self.best_model_path]
|
||||
|
||||
# # uninject mp_rank from paths
|
||||
# self.kth_best_model_path = self._uninject_mp_rank(self.kth_best_model_path)
|
||||
# self.best_model_path = self._uninject_mp_rank(self.best_model_path)
|
||||
|
||||
@staticmethod
|
||||
def _uninject_mp_rank(filepath):
|
||||
dirname = os.path.dirname(os.path.dirname(filepath))
|
||||
basename = os.path.basename(filepath)
|
||||
filepath = os.path.join(dirname, basename)
|
||||
return filepath
|
||||
|
||||
def on_save_checkpoint(self, trainer, pl_module, checkpoint):
|
||||
output = super().on_save_checkpoint(trainer, pl_module, checkpoint)
|
||||
if not self.always_save_nemo:
|
||||
|
@ -756,41 +781,21 @@ class NeMoModelCheckpoint(ModelCheckpoint):
|
|||
if trainer.fast_dev_run:
|
||||
return None
|
||||
|
||||
monitor_candidates = self._monitor_candidates(trainer, trainer.current_epoch, trainer.global_step - 1)
|
||||
self._save_last_checkpoint(trainer, monitor_candidates=monitor_candidates)
|
||||
# Call parent on_train_end() to save the -last checkpoint
|
||||
super().on_train_end(trainer, pl_module)
|
||||
|
||||
# Load the best model and then re-save it
|
||||
if self.save_best_model:
|
||||
if self.best_model_path is "":
|
||||
if self.best_model_path == "":
|
||||
logging.warning(
|
||||
f"{self} was told to save the best checkpoint at the end of training, but no saved checkpoints "
|
||||
"were found. Saving latest model instead."
|
||||
)
|
||||
else:
|
||||
trainer.checkpoint_connector.restore(self.best_model_path)
|
||||
|
||||
pl_module.save_to(save_path=os.path.join(self.dirpath, self.prefix + self.postfix))
|
||||
|
||||
def _del_model(self, trainer: "pl.Trainer", filepath: str) -> None:
|
||||
""" Overrides PTL method to account for model parallel checkpoints.
|
||||
Updates checkpoint path based on model parallel rank.
|
||||
"""
|
||||
app_state = AppState()
|
||||
if app_state.model_parallel_size is not None:
|
||||
# filepath needs to be updated to include mp_rank
|
||||
# TODO: figure out a good way to update these filepaths
|
||||
dirname = os.path.dirname(filepath)
|
||||
basename = os.path.basename(filepath)
|
||||
filepath = f'{dirname}/mp_rank_{app_state.model_parallel_rank:02d}/{basename}'
|
||||
|
||||
# each model parallel rank needs to remove its model
|
||||
if app_state.data_parallel_rank == 0:
|
||||
if self._fs.exists(filepath):
|
||||
self._fs.rm(filepath)
|
||||
logging.info(f"Removed model parallel checkpoint: {filepath}")
|
||||
|
||||
else:
|
||||
return super()._del_model(trainer, filepath)
|
||||
|
||||
def _del_model_without_trainer(self, filepath: str) -> None:
|
||||
app_state = AppState()
|
||||
if app_state.model_parallel_size is not None:
|
||||
|
@ -807,61 +812,6 @@ class NeMoModelCheckpoint(ModelCheckpoint):
|
|||
except:
|
||||
logging.info(f"Tried to remove checkpoint: {filepath} but failed.")
|
||||
|
||||
def _save_last_checkpoint(self, trainer: 'pl.Trainer', monitor_candidates: Dict[str, _METRIC]) -> None:
|
||||
""" Overrides PTL method to account for model parallel checkpoints.
|
||||
Checks for data parallel rank 0 rather than global rank 0.
|
||||
"""
|
||||
app_state = AppState()
|
||||
if app_state.model_parallel_size is not None:
|
||||
if not self.save_last:
|
||||
return
|
||||
|
||||
filepath = self._format_checkpoint_name(self.CHECKPOINT_NAME_LAST, monitor_candidates)
|
||||
filepath = os.path.join(self.dirpath, f"{filepath}{self.FILE_EXTENSION}")
|
||||
|
||||
trainer.save_checkpoint(filepath)
|
||||
|
||||
# TODO: figure out where self.last_model_path is being set
|
||||
if self.last_model_path is not None:
|
||||
if 'mp_rank' in self.last_model_path:
|
||||
last_model_path = Path(self.last_model_path)
|
||||
last_model_path = last_model_path.parent.parent.joinpath(last_model_path.name)
|
||||
self.last_model_path = str(last_model_path)
|
||||
|
||||
# for model parallel we need to delete models for each model parallel rank
|
||||
if self.last_model_path and self.last_model_path != filepath and app_state.data_parallel_rank == 0:
|
||||
self._del_model(trainer, self.last_model_path)
|
||||
|
||||
self.last_model_path = filepath
|
||||
|
||||
else:
|
||||
return super()._save_last_checkpoint(trainer, monitor_candidates)
|
||||
|
||||
def _save_none_monitor_checkpoint(self, trainer: 'pl.Trainer', monitor_candidates: Dict[str, _METRIC]) -> None:
|
||||
""" Overrides PTL method to account for model parallel checkpoints.
|
||||
Checks for data parallel rank 0 rather than global rank 0.
|
||||
"""
|
||||
app_state = AppState()
|
||||
if app_state.model_parallel_size is not None:
|
||||
if self.monitor is not None or self.save_top_k == 0:
|
||||
return
|
||||
|
||||
filepath = self._get_metric_interpolated_filepath_name(monitor_candidates, trainer)
|
||||
|
||||
trainer.save_checkpoint(filepath)
|
||||
|
||||
if (
|
||||
self.save_top_k is None
|
||||
and self.best_model_path
|
||||
and self.best_model_path != filepath
|
||||
and app_state.data_parallel_rank == 0
|
||||
):
|
||||
self._del_model(trainer, self.best_model_path)
|
||||
|
||||
self.best_model_path = filepath
|
||||
else:
|
||||
return super()._save_none_monitor_checkpoint(trainer, monitor_candidates)
|
||||
|
||||
|
||||
def configure_checkpointing(
|
||||
trainer: 'pytorch_lightning.Trainer', log_dir: Path, name: str, resume: bool, params: 'DictConfig'
|
||||
|
@ -924,6 +874,10 @@ def configure_checkpointing(
|
|||
|
||||
checkpoint_callback = NeMoModelCheckpoint(n_resume=resume, **params)
|
||||
checkpoint_callback.last_model_path = trainer.checkpoint_connector.resume_checkpoint_path or ""
|
||||
if params.model_parallel_size is not None:
|
||||
checkpoint_callback.last_model_path = NeMoModelCheckpoint._uninject_mp_rank(
|
||||
checkpoint_callback.last_model_path
|
||||
)
|
||||
trainer.callbacks.append(checkpoint_callback)
|
||||
|
||||
|
||||
|
|
|
@ -10,6 +10,9 @@ echo 'Uninstalling stuff'
|
|||
${PIP} uninstall -y nemo_toolkit
|
||||
${PIP} uninstall -y sacrebleu
|
||||
|
||||
# TODO: revert when 1.5.0 is out
|
||||
${PIP} uninstall -y pytorch-lightning
|
||||
|
||||
# Kept for legacy purposes
|
||||
${PIP} uninstall -y nemo_asr
|
||||
${PIP} uninstall -y nemo_nlp
|
||||
|
@ -19,6 +22,9 @@ ${PIP} uninstall -y nemo_cv
|
|||
|
||||
${PIP} install -U setuptools
|
||||
|
||||
# TODO: revert when 1.5.0 is out
|
||||
${PIP} install pytorch-lightning==1.5.0rc0
|
||||
|
||||
echo 'Installing nemo and nemo_text_processing'
|
||||
if [[ "$INSTALL_OPTION" == "dev" ]]; then
|
||||
${PIP} install --editable ".[all]"
|
||||
|
|
|
@ -3,4 +3,5 @@ torchmetrics>=0.4.1rc0
|
|||
transformers>=4.0.1
|
||||
webdataset>=0.1.48,<=0.1.62
|
||||
omegaconf>=2.1.0
|
||||
hydra-core>=1.1.0
|
||||
hydra-core>=1.1.0
|
||||
pyyaml<6 # Pinned until omegaconf works with pyyaml>=6
|
|
@ -1074,7 +1074,7 @@
|
|||
"source": [
|
||||
"## Adding evaluation metrics\n",
|
||||
"\n",
|
||||
"Here is an example of how to use more metrics (e.g. from pytorch_lightning) to evaluate your result.\n",
|
||||
"Here is an example of how to use more metrics (e.g. from torchmetrics) to evaluate your result.\n",
|
||||
"\n",
|
||||
"**Note:** If you would like to add metrics for training and testing, have a look at \n",
|
||||
"```python\n",
|
||||
|
@ -1088,7 +1088,7 @@
|
|||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from pytorch_lightning.metrics.classification import ConfusionMatrix"
|
||||
"from torchmetrics import ConfusionMatrix"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
|
|
@ -9,31 +9,6 @@
|
|||
"BRANCH = 'main'"
|
||||
]
|
||||
},
|
||||
"kernelspec": {
|
||||
"name": "python3",
|
||||
"display_name": "Python 3"
|
||||
},
|
||||
"accelerator": "GPU",
|
||||
"pycharm": {
|
||||
"stem_cell": {
|
||||
"cell_type": "raw",
|
||||
"source": [],
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"BRANCH = 'main'"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
|
@ -693,15 +668,13 @@
|
|||
"Alternatively you can pass text for restoring punctuation and capitalization as plain text. See help for parameters `--input_text` and `--output_text` of the script\n",
|
||||
"[punctuate_capitalize_infer.py](https://github.com/NVIDIA/NeMo/blob/stable/examples/nlp/token_classification/punctuate_capitalize_infer.py>).\n",
|
||||
"\n",
|
||||
"The script [punctuate_capitalize_infer.py](https://github.com/NVIDIA/NeMo/blob/stable/examples/nlp/token_classification/punctuate_capitalize_infer.py>)\n",
|
||||
"can restore punctuation and capitalization in a text of arbitrary length. Long sequences are split into segments\n",
|
||||
"The script `examples/nlp/token_classification/punctuate_capitalize_infer.py` can restore punctuation and capitalization in a text of arbitrary length. Long sequences are split into segments\n",
|
||||
"`--max_seq_length - 2` tokens each. Each segment starts and ends with `[CLS]` and `[SEP]`\n",
|
||||
"tokens correspondingly. Every segment is offset to the previous one by `--step` tokens. For example, if\n",
|
||||
"every character is a token, `--max_seq_length=5`, `--step=2`, then text `\"hello\"` will be split into\n",
|
||||
"segments `[['[CLS]', 'h', 'e', 'l', '[SEP]'], ['[CLS]', 'l', 'l', 'o', '[SEP]']]`.\n",
|
||||
"\n",
|
||||
"If segments overlap, then predicted probabilities for a token present in several segments are multiplied before\n",
|
||||
"before selecting the best candidate.\n",
|
||||
"If segments overlap, then predicted probabilities for a token present in several segments are multiplied before selecting the best candidate.\n",
|
||||
"\n",
|
||||
"Splitting leads to pour performance of a model near edges of segments. Use parameter `--margin` to discard `--margin`\n",
|
||||
"probabilities predicted for `--margin` tokens near segment edges. For example, if every character is a token, `--max_seq_length=5`, `--step=1`, `--margin=1`, then text `\"hello\"` will be split into segments `[['[CLS]', 'h', 'e', 'l', '[SEP]'], ['[CLS]', 'e', 'l', 'l', '[SEP]'], ['[CLS]', 'l', 'l', 'o', '[SEP]']]`. Before calculating actual predictions, probabilities for tokens marked by asterisk are removed: `[['[CLS]', 'h', 'e', 'l'*, '[SEP]'*], ['[CLS]'*, 'e'*, 'l', 'l'*, '[SEP]'*], ['[CLS]'*, 'l'*, 'l', 'o', '[SEP]']]`.\n",
|
||||
|
@ -850,17 +823,6 @@
|
|||
"trainer = pl.Trainer(gpus=1, fast_dev_run=fast_dev_run)\n",
|
||||
"trainer.fit(pretrained_model)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"colab": {},
|
||||
"colab_type": "code",
|
||||
"id": "l7A5FeiTl6Zd"
|
||||
},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
|
@ -888,7 +850,7 @@
|
|||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.8.8"
|
||||
"version": "3.7.7"
|
||||
},
|
||||
"pycharm": {
|
||||
"stem_cell": {
|
||||
|
|
|
@ -490,7 +490,7 @@
|
|||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
|
@ -504,7 +504,7 @@
|
|||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.8.8"
|
||||
"version": "3.8.10"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
|
|
Loading…
Reference in a new issue