slurm and MPI example

 

Overview¶

RCC supports these MPI implementations:

  • IntelMPI
  • MVAPICH2
  • OpenMPI

Each MPI implementation usually has a module available for use with GCC, the Intel Compiler Suite, and PGI. For example, at the time of this writing these MPI modules were available:

openmpi/1.6(default)
openmpi/1.6+intel-12.1
openmpi/1.6+pgi-2012
mvapich2/1.8(default)
mvapich2/1.8+intel-12.1
mvapich2/1.8+pgi-2012
mvapich2/1.8-gpudirect
mvapich2/1.8-gpudirect+intel-12.1
intelmpi/4.0
intelmpi/4.0+intel-12.1(default)

MPI Implementation Notes¶

The different MPI implementations have different options and features. Any notable differences are noted here.

IntelMPI¶

IntelMPI uses an environment variable to affect the network communication fabric it uses:

I_MPI_FABRICS

During job launch the Slurm TaskProlog detects the network hardware and sets this variable approately. This will typically be set to shm:ofa, which makes IntelMPI use shared memory communication followed by ibverbs. If a job is run on a node without Infiniband this will be set to shm which uses shared memory only and limits IntelMPI to a single node job. This is usually what is wanted on nodes without a high speed interconnect. This variable can be overridden if desired in the submission script.

MVAPICH2¶

MVAPICH2 is compiled with the OFA-IB-CH3 interface. There is no support for running programs compiled with MVAPICH2 on loosely coupled nodes.

GPUDirect builds of MVAPICH2 with CUDA enabled are available for use on the GPU nodes. These builds are otherwise identical to the standard MVAPICH2 build.

OpenMPI¶

Nothing at this time.

Example¶

Let’s look at an example MPI hello world program and explain the steps needed to compile and submit it to the queue. An example MPI hello world program: hello-mpi.c

#include 
#include 
#include 

int main(int argc, char *argv[], char *envp[]) {
  int numprocs, rank, namelen;
  char processor_name[MPI_MAX_PROCESSOR_NAME];

  MPI_Init(&argc, &argv);
  MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Get_processor_name(processor_name, &namelen);

  printf("Process %d on %s out of %d\n", rank, processor_name, numprocs);

  MPI_Finalize();
}

Place hello-mpi.c in your home directory. Compile and execute this program interactively by entering the following commands into the terminal:

module load openmpi
mpicc hello-mpi.c -o hello-mpi

In this case we are using the default version of the openmpi module which defaults to the GCC compiler.  It should be possible to use any available MPI/compiler for this example.

hello-mpi.sbatch is a submission script that can be used to submit a job to the queue to run this program.

#!/bin/bash

# set the job name to hello-mpi
#SBATCH --job-name=hello-mpi

# send output to hello-mpi.out
#SBATCH --output=hello-mpi.out

# this job requests 2 nodes
#SBATCH --nodes=2

# this job requests exclusive access to the nodes it is given
# this mean it will be the only job running on the node
#SBATCH --exclusive

# --constraint=ib must be give to guarantee a job is allocated 
# nodes with Infiniband
#SBATCH --constraint=ib

# load the openmpi module
module load openmpi

# Run the process with mpirun. Notice -n is not required. mpirun will
# automatically figure out how many processes to run from the slurm options
mpirun ./hello-mpi

The inline comments describe what each line does, but is important to point out 3 things that almost all MPI jobs have in common:

  • --constraint=ib is given to guarantee a node with Infiniband is allocated
  • --exclusive is given to guarantee this job will be the only job on the node
  • mpirun does not need to be given -n. All supported MPI environments automatically determine the proper layout based on the slurm options

You can submit this job with this command:

sbatch hello-mpi.sbatch

Here is example output of this program:

Process 4 on midway123 out of 32
Process 0 on midway123 out of 32
Process 1 on midway123 out of 32
Process 2 on midway123 out of 32
Process 5 on midway123 out of 32
Process 15 on midway123 out of 32
Process 12 on midway123 out of 32
Process 7 on midway123 out of 32
Process 9 on midway123 out of 32
Process 14 on midway123 out of 32
Process 8 on midway123 out of 32
Process 24 on midway124 out of 32
Process 10 on midway123 out of 32
Process 11 on midway123 out of 32
Process 3 on midway123 out of 32
Process 6 on midway123 out of 32
Process 13 on midway123 out of 32
Process 17 on midway124 out of 32
Process 20 on midway124 out of 32
Process 19 on midway124 out of 32
Process 25 on midway124 out of 32
Process 27 on midway124 out of 32
Process 26 on midway124 out of 32
Process 29 on midway124 out of 32
Process 28 on midway124 out of 32
Process 31 on midway124 out of 32
Process 30 on midway124 out of 32
Process 18 on midway124 out of 32
Process 22 on midway124 out of 32
Process 21 on midway124 out of 32
Process 23 on midway124 out of 32
Process 16 on midway124 out of 32

It is possible to affect the number of tasks run per node with the--ntasks-per-node option. Submitting the job like this:

sbatch --ntasks-per-node=1 hello-mpi.sbatch

Results in output like this:

Process 0 on midway123 out of 2
Process 1 on midway124 out of 2

Advanced Usage¶

Both OpenMPI and IntelMPI have the possibility to launch MPI programs directly with the Slurm command srun. It is not necessary to use this mode for most jobs, but it may allow job launch options that would not otherwise be possible. For example, on a login node it is possible to launch the above hello-mpi command using OpenMPI directly on a compute node with this command:

srun --constraint=ib -n16 --exclusive hello-mpi

For IntelMPI, it is necessary to set an environment variable for this to work:

export I_MPI_PMI_LIBRARY=/software/slurm-current-$DISTARCH/lib/libpmi.so
srun --constraint=ib -n16 --exclusive hello-mpi

你可能感兴趣的:(Linux)