Multi-GPU Computing with SELF#
SELF builds come with domain-decomposition enabled by default, which allows you to run in parallel using distributed memory parallelism with MPI. When you build SELF with GPU acceleration enabled, native support is provided for multi-GPU platforms. The key requirement is that you are working with a GPU Aware MPI installation (see GPU Aware MPI with ROCm and GPU Aware MPI with CUDA).
Key Concepts:#
- Process Affinity: Bind MPI processes to specific CPUs/cores and GPUs to optimize performance by reducing interconnect overhead.
- Sequential Rank Assignment: Assign MPI ranks such that ranks for a node are packed together before moving to the next node.
- Mapping Policies: Use
mpirun
's mapping and binding options to control how MPI processes are distributed across nodes and resources.
When deploying SELF on multi-GPU platforms, each MPI rank is assigned to a single GPU. The GPU assignment algorithm is simple and is implemented in the Init_DomainDecomposition
method defined in the src/gpu/SELF_DomainDecomposition.f90
module. Each MPI rank queries HIP or CUDA for the number of GPU devices on your server. The device id assigned to the MPI process is set as the modulo of the MPI rank id and the number of devices.
If you are working on clusters of multi-GPU accelerated nodes, this implies that all servers have the same number of GPUs. Additionally, if you are explicitly setting your process affinity, you will want to assign MPI ranks to pack servers sequentially.
mpirun
Options for Sequential Affinity:#
Use these options when launching your application with mpirun
:
--map-by ppr:N:node
: PlacesN
processes per node.--bind-to core
or--bind-to socket
: Binds processes to specific cores or sockets.--rank-by slot
or--rank-by node
: Determines how ranks are ordered within the mapping.
Example Command:#
Explanation of Options:#
--map-by ppr:1:node
: Places 1 MPI process per GPU (assuming one GPU per rank) and packs the ranks sequentially on each node.--bind-to core
: Binds each MPI process to a core, ensuring proper CPU affinity.--rank-by node
: Ensures that ranks are assigned sequentially across nodes.-np <num_procs>
: Specifies the total number of MPI processes.
Example for Multi-GPU Nodes:#
If each node has 4 GPUs and you want 4 MPI processes per node:
This ensures:
- MPI ranks 0-3
are on Node 1 (bound to GPUs 0-3
).
- MPI ranks 4-7
are on Node 2 (bound to GPUs 0-3
), and so on.
To validate process affinity and report process bindings, both MPI and Slurm provide tools and options to display detailed information about how MPI ranks and processes are mapped to hardware resources (e.g., CPUs, GPUs).
Validating process bindings with mpirun
#
Most MPI implementations have options to print detailed binding information.
OpenMPI#
OpenMPI provides options to report process bindings:
- Add --report-bindings
to your mpirun
command:
mpirun --report-bindings --map-by ppr:1:node --bind-to core --rank-by node -np <num_procs> ./your_application
-
Example Output:
-
Use
--display-map
to visualize the mapping of ranks across nodes: -
Example Output:
MPICH#
MPICH can display binding information using the MPICH_RANK_REORDER_DISPLAY
environment variable:
- Set the environment variable before running:
Launching with Slurm:#
For Slurm-managed clusters, the equivalent command is:
srun --ntasks-per-node=4 --cpus-per-task=<cpus_per_mpi_process> --gpus-per-task=1 --distribution=block:block ./your_application
This approach ensures proper resource packing and sequential affinity per node.
Validating process bindings with Slurm#
Slurm provides options to display detailed information about how tasks are distributed and bound.
Job Execution Information#
-
Add the
--cpu-bind
or--gpu-bind
flag tosrun
to specify binding and display it: -
Example Output:
Slurm Environment Variables#
-
Slurm provides environment variables during job execution, which you can print from within your application:
-
Example Fortran snippet:
Job Allocation Report#
Run scontrol show job <job_id>
to display task and resource binding for a running or completed job:
- Relevant fields in the output include:
TaskAffinity
CpusPerTask
TRES
(e.g., GPUs, memory)Nodes
andNodelist
System-Specific Considerations:#
mpirun
Implementation: The specific MPI implementation (e.g., OpenMPI, MPICH) might have slightly different syntax or options. Check your implementation’s documentation.- Resource Manager Integration: If using a resource manager like Slurm, consider its process binding flags (e.g.,
--distribution block:block
or--ntasks-per-node
). - NUMA domains : When assing process affinity, you should also consider the latency between CPUs and GPUs on your system. On AMD platforms, you can use
rocm-bandwidth-test
to report on your system's topology; On Nvidia platforms, you can usenvidia-smi -topo
. Ideally, MPI ranks should be assigned to the NUMA domain closest to their assigned GPU.