SLURMEnvironment — PyTorch Lightning 2.5.1.post0 documentation (original) (raw)
class lightning.pytorch.plugins.environments.SLURMEnvironment(auto_requeue=True, requeue_signal=None)[source]¶
Bases: ClusterEnvironment
Cluster environment for training on a cluster managed by SLURM.
You can configure the main_address and main_port properties via the env variables MASTER_ADDR andMASTER_PORT, respectively.
Parameters:
- auto_requeue¶ (bool) – Whether automatic job resubmission is enabled or not. How and under which conditions a job gets rescheduled gets determined by the owner of this plugin.
- requeue_signal¶ (Optional[Signals]) – The signal that SLURM will send to indicate that the job should be requeued. Defaults to SIGUSR1 on Unix.
Returns True
if the current process was launched on a SLURM cluster.
It is possible to use the SLURM scheduler to request resources and then launch processes manually using a different environment. For this, the user can set the job name in SLURM to ‘bash’ or ‘interactive’ (srun –job- name=interactive). This will then avoid the detection of SLURMEnvironment
and another environment can be detected automatically.
Return type:
The rank (index) of the currently running process across all nodes and devices.
Return type:
The rank (index) of the currently running process inside of the current node.
Return type:
The rank (index) of the node on which the current process runs.
Return type:
static resolve_root_node_address(nodes)[source]¶
The node selection format in SLURM supports several formats.
This function selects the first host name from :rtype: str
- a space-separated list of host names, e.g., ‘host0 host1 host3’ yields ‘host0’ as the root
- a comma-separated list of host names, e.g., ‘host0,host1,host3’ yields ‘host0’ as the root
- the range notation with brackets, e.g., ‘host[5-9]’ yields ‘host5’ as the root
validate_settings(num_devices, num_nodes)[source]¶
Validates settings configured in the script against the environment, and raises an exception if there is an inconsistency.
Return type:
The number of processes across all devices and nodes.
Return type:
property creates_processes_externally_: bool_¶
Whether the environment creates the subprocesses or not.
The main address through which all processes connect and communicate.
An open and configured port in the main node through which all processes communicate.