SLURMEnvironment — PyTorch Lightning 2.5.1.post0 documentation (original) (raw)

class lightning.pytorch.plugins.environments.SLURMEnvironment(auto_requeue=True, requeue_signal=None)[source]

Bases: ClusterEnvironment

Cluster environment for training on a cluster managed by SLURM.

You can configure the main_address and main_port properties via the env variables MASTER_ADDR andMASTER_PORT, respectively.

Parameters:

static detect()[source]

Returns True if the current process was launched on a SLURM cluster.

It is possible to use the SLURM scheduler to request resources and then launch processes manually using a different environment. For this, the user can set the job name in SLURM to ‘bash’ or ‘interactive’ (srun –job- name=interactive). This will then avoid the detection of SLURMEnvironment and another environment can be detected automatically.

Return type:

bool

global_rank()[source]

The rank (index) of the currently running process across all nodes and devices.

Return type:

int

local_rank()[source]

The rank (index) of the currently running process inside of the current node.

Return type:

int

node_rank()[source]

The rank (index) of the node on which the current process runs.

Return type:

int

static resolve_root_node_address(nodes)[source]

The node selection format in SLURM supports several formats.

This function selects the first host name from :rtype: str

validate_settings(num_devices, num_nodes)[source]

Validates settings configured in the script against the environment, and raises an exception if there is an inconsistency.

Return type:

None

world_size()[source]

The number of processes across all devices and nodes.

Return type:

int

property creates_processes_externally_: bool_

Whether the environment creates the subprocesses or not.

property main_address_: str_

The main address through which all processes connect and communicate.

property main_port_: int_

An open and configured port in the main node through which all processes communicate.