Multi-GPU Computing with SELF#

SELF builds come with domain-decomposition enabled by default, which allows you to run in parallel using distributed memory parallelism with MPI. When you build SELF with GPU acceleration enabled, native support is provided for multi-GPU platforms. The key requirement is that you are working with a GPU Aware MPI installation (see GPU Aware MPI with ROCm and GPU Aware MPI with CUDA).

Key Concepts:#

Process Affinity: Bind MPI processes to specific CPUs/cores and GPUs to optimize performance by reducing interconnect overhead.
Sequential Rank Assignment: Assign MPI ranks such that ranks for a node are packed together before moving to the next node.
Mapping Policies: Use mpirun's mapping and binding options to control how MPI processes are distributed across nodes and resources.

When deploying SELF on multi-GPU platforms, each MPI rank is assigned to a single GPU. The GPU assignment algorithm is simple and is implemented in the Init_DomainDecomposition method defined in the src/gpu/SELF_DomainDecomposition.f90 module. Each MPI rank queries HIP or CUDA for the number of GPU devices on your server. The device id assigned to the MPI process is set as the modulo of the MPI rank id and the number of devices.

If you are working on clusters of multi-GPU accelerated nodes, this implies that all servers have the same number of GPUs. Additionally, if you are explicitly setting your process affinity, you will want to assign MPI ranks to pack servers sequentially.

`mpirun` Options for Sequential Affinity:#

Use these options when launching your application with mpirun:

--map-by ppr:N:node: Places N processes per node.
--bind-to core or --bind-to socket: Binds processes to specific cores or sockets.
--rank-by slot or --rank-by node: Determines how ranks are ordered within the mapping.

Example Command:#

mpirun --map-by ppr:1:node --bind-to core --rank-by node -np <num_procs> ./your_self_application

Explanation of Options:#

--map-by ppr:1:node: Places 1 MPI process per GPU (assuming one GPU per rank) and packs the ranks sequentially on each node.
--bind-to core: Binds each MPI process to a core, ensuring proper CPU affinity.
--rank-by node: Ensures that ranks are assigned sequentially across nodes.
-np <num_procs>: Specifies the total number of MPI processes.

Example for Multi-GPU Nodes:#

If each node has 4 GPUs and you want 4 MPI processes per node:

mpirun --map-by ppr:4:node --bind-to core --rank-by node -np <total_procs> ./your_self_application

This ensures: - MPI ranks 0-3 are on Node 1 (bound to GPUs 0-3). - MPI ranks 4-7 are on Node 2 (bound to GPUs 0-3), and so on.

To validate process affinity and report process bindings, both MPI and Slurm provide tools and options to display detailed information about how MPI ranks and processes are mapped to hardware resources (e.g., CPUs, GPUs).

Validating process bindings with `mpirun`#

Most MPI implementations have options to print detailed binding information.

OpenMPI#

OpenMPI provides options to report process bindings: - Add --report-bindings to your mpirun command:

mpirun --report-bindings --map-by ppr:1:node --bind-to core --rank-by node -np <num_procs> ./your_application

Example Output:

Rank 0: bound to socket 0[core 0]
Rank 1: bound to socket 0[core 1]
Rank 2: bound to socket 0[core 2]
Rank 3: bound to socket 0[core 3]

Use --display-map to visualize the mapping of ranks across nodes:

mpirun --display-map --map-by ppr:1:node --bind-to core ./your_application

Example Output:

Data for JOB [12345,1] offset 0
Mapping policy: BYNODE, Binding policy: CORE
...
Node: node01  Ranks: 0, 1, 2, 3
Node: node02  Ranks: 4, 5, 6, 7

MPICH#

MPICH can display binding information using the MPICH_RANK_REORDER_DISPLAY environment variable: - Set the environment variable before running:

export MPICH_RANK_REORDER_DISPLAY=1
mpirun -np <num_procs> ./your_application

- Example Output:

Rank 0 running on node01
Rank 1 running on node01
...

Launching with Slurm:#

For Slurm-managed clusters, the equivalent command is:

srun --ntasks-per-node=4 --cpus-per-task=<cpus_per_mpi_process> --gpus-per-task=1 --distribution=block:block ./your_application

This approach ensures proper resource packing and sequential affinity per node.

Validating process bindings with Slurm#

Slurm provides options to display detailed information about how tasks are distributed and bound.

Job Execution Information#

Add the --cpu-bind or --gpu-bind flag to srun to specify binding and display it:

srun --cpu-bind=verbose --gpus-per-task=1 --ntasks-per-node=4 ./your_application

Example Output:

srun: CPU Bindings: rank 0 on cores 0-3
srun: CPU Bindings: rank 1 on cores 4-7

Slurm Environment Variables#

Slurm provides environment variables during job execution, which you can print from within your application:

getenv("SLURM_TASKS_PER_NODE")   ! Number of tasks per node
getenv("SLURM_CPUS_ON_NODE")     ! Number of CPUs allocated
getenv("SLURM_JOB_NODELIST")     ! List of allocated nodes
getenv("SLURM_LOCALID")          ! Local rank ID
getenv("SLURM_PROCID")           ! Global rank ID
getenv("SLURM_CPU_BIND")         ! CPU binding

Example Fortran snippet:

print *, "SLURM_PROCID:", getenv("SLURM_PROCID")
print *, "SLURM_CPU_BIND:", getenv("SLURM_CPU_BIND")

Job Allocation Report#

Run scontrol show job <job_id> to display task and resource binding for a running or completed job:

scontrol show job <job_id>

Relevant fields in the output include:
TaskAffinity
CpusPerTask
TRES (e.g., GPUs, memory)
Nodes and Nodelist

System-Specific Considerations:#

mpirun Implementation: The specific MPI implementation (e.g., OpenMPI, MPICH) might have slightly different syntax or options. Check your implementation’s documentation.
Resource Manager Integration: If using a resource manager like Slurm, consider its process binding flags (e.g., --distribution block:block or --ntasks-per-node).
NUMA domains : When assing process affinity, you should also consider the latency between CPUs and GPUs on your system. On AMD platforms, you can use rocm-bandwidth-test to report on your system's topology; On Nvidia platforms, you can use nvidia-smi -topo. Ideally, MPI ranks should be assigned to the NUMA domain closest to their assigned GPU.

Multi-GPU Computing with SELF#

Key Concepts:#

mpirun Options for Sequential Affinity:#

Example Command:#

Explanation of Options:#

Example for Multi-GPU Nodes:#

Validating process bindings with mpirun#

OpenMPI#

MPICH#

Launching with Slurm:#

Validating process bindings with Slurm#

Job Execution Information#

Slurm Environment Variables#

Job Allocation Report#

System-Specific Considerations:#

`mpirun` Options for Sequential Affinity:#

Validating process bindings with `mpirun`#