Known Issues — NVIDIA NeMo Framework User Guide (original) (raw)
We will release fixes for the following issues shortly:
- 25.02 Known Issues
- Automodel
* Primarily a functional release, performance improvements are planned for future versions.
* For large models (e.g., > 40B) trained with FSDP2, checkpoint saving can take longer than expected.
* Support for long sequences is currently limited, esp. for large models > 30B.
* Models with external dependencies may fail to run, if dependencies are unavailable (e.g., missing package leading to failed import).
* A small percentage of models available via AutoModelForCausalLM may only support inference, and have training capabilities explicitly disabled.
* Support for FSDP2 with mixed weights models (e.g. FP8 + BF16) is scheduled for future releases. - Support for Context Parallelism with sequence packing + padding between sequences is currently broken (see issue #12174). Use 24.12 or upgrade to TE 2.0+ for working support. Will be fixed in future versions.
- MoE based models are seeing an instability with training. Please continue to use 24.12 for MoE training until 25.02 is patched with the fix for MoE.
- Automodel
- In 24.12, NeMo switched from pytorch_lightning to lightning.pytorch. If you have custom code that imports pytorch_lightning, you should replace the import with lightning.pytorch. Failing to do so will result in an error that looks like this:
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/utilities/model_helpers.py", line 42, in is_overridden
raise ValueError("Expected a parent")
ValueError: Expected a parent - Similarly, when using a 24.12 container or later, if running evaluations using the LM Evaluation Harness, be sure to upgrade the version of LM evaluation harness to include this commit. This can be done by following these install instructions. Failing to do so will results in an error that looks like this:
ValueError: You selected an invalid strategy name:strategy=<nemo.collections.nlp.parts.nlp_overrides.NLPDDPStrategy object at 0x1554480d2410>
. It must be either a string or an instance ofpytorch_lightning.strategies.Strategy
.
Example choices: auto, ddp, ddp_spawn, deepspeed, ... Find a complete list of options in our documentation at https://lightning.ai - Restoring the model context for NeMo 2.0 checkpoints produced using the NeMo 24.09 container fails when building the OptimizerConfig class from the megatron.core.optimizer.optimizer_config module, as the overlap_grad_reduce and overlap_param_gather parameters were moved from the config API in Megatron Core. The update_io_context.py script drops unknown parameters from the checkpoint context to make it compatible with the latest container.
- Griffin’s (NeMo 1.0) full fine-tuning has checkpoint loading issues; the state dicts are not matching between the provided checkpoint and the initialized model. Please use the 24.07 container if this model is needed.
- NeMo_Forced_Aligner_Tutorial.ipynb has an AttributeError, please use the 24.09 container if this notebook is needed.
- Pretrain Gemma 2 27b recipe needs at least 2 nodes, currently the recipe has the default number of nodes set to 1.
- The Megatron Core Distributed Optimizer currently lacks memory capacity optimization, resulting in higher model state memory usage at small data parallel sizes. We will include this optimization in the next patch.
- The overlap of the data-parallel parameter AllGather with optimizer.step (
overlap_param_gather_with_optimizer=true
) does not work with distributed checkpointing. Support for distributed checkpointing will be available in the next public release. - Support for converting models from NeMo 2.0 to 1.0 is not yet available. This support will be needed to align models until NeMo Aligner natively supports 2.0.
- Transformer Engine changed the way metadata is stored in checkpoints after v1.10, which can cause checkpoint incompatibilities when using a Transformer Engine version later than v1.10 to load a checkpoint trained with an earlier version. Errors of this form look similar to the following:
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/checkpoint/default_planner.py", line 315, in create_default_local_load_plan
raise RuntimeError(f"Missing key in checkpoint state_dict: {fqn}.")
RuntimeError: Missing key in checkpoint state_dict: model.decoder.layers.self_attention.core_attention._extra_state/shard_0_24.
or
File "/opt/megatron-lm/megatron/core/dist_checkpointing/strategies/common.py", line 118, in load_sharded_object
raise CheckpointingException(err_msg) from e
megatron.code.dist_checkpointing.core.CheckpointingException: Object shard .../model.decoder.layers.self_attention.core_attention._extra_state/shard_0_4.pt not found
To work around this issue, usemodel.dist_ckpt_load_strictness=log_all
when working with Transformer Engine v1.10 or higher. You can find the Transformer Engine versions present in each NeMo container on the Software Component Versions page. - For data preparation of GPT models, use your own dataset or an online dataset legally approved by your organization.
- A race condition in the NeMo experiment manager can occur when multiple processes or threads attempt to access and modify shared resources simultaneously, leading to unpredictable behavior or errors.
- The Mistral and Mixtral tokenizers require a Hugging Face login.
- Exporting Gemma, Starcoder, and Falcon 7B models to TRT-LLM only works with a single GPU. Additionally, if you attempt to export with multiple GPUs, no descriptive error message is shown.
- The following notebooks have functional issues and will be fixed in the next release:
- ASR_with_NeMo.ipynb
- ASR_with_Subword_Tokenization.ipynb
- AudioTranslationSample.ipynb
- Megatron_Synthetic_Tabular_Data_Generation.ipynb
- SpellMapper_English_ASR_Customization.ipynb
- FastPitch_ChineseTTS_Training.ipynb
- NeVA Tutorial.ipynb
- Export
- Export Llama70B vLLM causes an out-of-memory issue. It requires more time for the root cause analysis.
- Export vLLM does not support LoRA and P-tuning; however, LoRA support will be added in the next release.
- In-framework (PyTorch level) deployment with 8 GPUs is encountering an error; more time is needed to understand the cause.
- Query script under scripts/deploy/nlp/query.py is giving the An error occurred: ‘output_generation_logits’ error in the 24.12 container. It’ll be fixed in the next container release.
- Multimodal - LITA tutorial issue: tutorials/multimodal/LITA_Tutorial.ipynb The data preparation part requires users to manually download the youmakeup dataset instead of using the provided script. - LITA (Language-Independent Tokenization Algorithm) tutorial issue: The data preparation part in tutorials/multimodal/LITA_Tutorial.ipynb requires you to manually download the youmakeup dataset instead of using the provided script. - Add the argument,
exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True
, to the NeVA notebook pretraining procedure to ensure an end-to-end workflow. Additional argumentexp_manager.checkpoint_callback_params.save_nemo_on_train_end=True
should be added to Neva Notebook Pretraining Part to ensure e2e workflow. - ASR - Timestamp misalignment occurs in FastConformer ASR models when using the ASR decoder for diarization. Related Issue: #8438.