MPI 学习笔记

  • message passing model
  • the Message Passing Interface (MPI) - standard interface
  • MPI is only a definition for an interface

几个重要的概念

  • communicator: 由一系列进程组成,拥有沟通的能力
  • 每个进程都有 rank 沟通交流靠秩 + tag(标记信息)
  • point-to-point communications | collective communications

编程篇

假设有个.c 文件 mpi

mpicc mpi_hello_world.c -o hello_world
mpirun -np 4 -f host_file hello_world //np 是processing的数量 
                                      //host_file 记录着集群的信息

host_file

Ailab1
Ailab2
Ailab3

如果不想平均分,想根据核数来
host_file

Ailab1:2
Ailab2:2
Ailab3:2

即可 有限Ailab1的两核,用完了再下一个

  • Sending and receiving are the two foundational concepts of MPI.
  • MPI allows senders and receivers to also specify message IDs with the message (known as tags)
send 和 recive 的原型
MPI_Send(
    void* data,
    int count,    // 送出了这么多 exactly
    MPI_Datatype datatype,
    int destination,
    int tag,
    MPI_Comm communicator)

MPI_Recv(
    void* data,
    int count, // 最多接受这么多 at most
    MPI_Datatype datatype,
    int source,
    int tag,
    MPI_Comm communicator,
    MPI_Status* status)
mpi datatype
MPI datatype C equivalent
MPI_SHORT short int
MPI_INT int
MPI_LONG long int
MPI_LONG_LONG long long int
MPI_UNSIGNED_CHAR unsigned char
MPI_UNSIGNED_SHORT unsigned short int
MPI_UNSIGNED unsigned int
MPI_UNSIGNED_LONG unsigned long int
MPI_UNSIGNED_LONG_LONG unsigned long long int
MPI_FLOAT float
MPI_DOUBLE double
MPI_LONG_DOUBLE long double
MPI_BYTE char

- 能够创建自己的own MPI datatypes

动态传输

利用status

    MPI_Status status;

接收的长度

MPI_Get_count(&status, MPI_INT, &number_amount);  
MPI_Probe(
    int source,
    int tag,
    MPI_Comm comm,
    MPI_Status* status)

像MPI_Recv 一样除了真实接收数据
动态接收数组

    MPI_Probe(0, 0, MPI_COMM_WORLD, &status);

    // When probe returns, the status object has the size and other
    // attributes of the incoming message. Get the message size
    MPI_Get_count(&status, MPI_INT, &number_amount);

    // Allocate a buffer to hold the incoming numbers
    int* number_buf = (int*)malloc(sizeof(int) * number_amount);

    // Now receive the message with the allocated buffer
    MPI_Recv(number_buf, number_amount, MPI_INT, 0, 0,
             MPI_COMM_WORLD, MPI_STATUS_IGNORE);
  • As an exercise, make a wrapper around MPI_Recv that uses MPI_Probe for any dynamic applications you might write. It makes the code look much nicer :-)
注意死锁的发生
  • MPI_Send 要 receive 完之后才 return,如果大家都 send 那就死锁了,教程里解决死锁的办法是奇偶 rank 的执行顺序不同,奇的话是先收后发,偶的话是先发后收

MPI Broadcast and Collective Communication

  • a synchronization point
  • 用于同步的函数
MPI_Barrier(MPI_Comm communicator)


- broadcast: one process sends the same data to all processes in a communicator.

MPI_Bcast(
    void* data,
    int count,
    MPI_Datatype datatype,
    int root,
    MPI_Comm communicator)

无论是 root 还是 receive 进程都需要调用MPI_Bcast 因为有变量指定了root
broadcast utilizes a similar tree broadcast algorithm

MPI_Wtime(); // 返回时间戳
  • MPI_Scatter 与 MPI_Bcast 很像, 唯一不同点是 MPI_Bcast 传相同数据而 MPI_Scatter 传不同的数据给不同的进程

MPI_Scatter(
    void* send_data,
    int send_count,
    MPI_Datatype send_datatype,
    void* recv_data,
    int recv_count,
    MPI_Datatype recv_datatype,
    int root,
    MPI_Comm communicator)
  • MPI_Gather
MPI_Gather(
    void* send_data,
    int send_count,
    MPI_Datatype send_datatype,
    void* recv_data,
    int recv_count,
    MPI_Datatype recv_datatype,
    int root,
    MPI_Comm communicator)

root 需要 receive buffer 其他的就传个 NULL 就可以

recv_count parameter 是每个进程发送的数量
- 参考代码

if (world_rank == 0) {
  rand_nums = create_rand_nums(elements_per_proc * world_size);
}

// Create a buffer that will hold a subset of the random numbers
float *sub_rand_nums = malloc(sizeof(float) * elements_per_proc);

// Scatter the random numbers to all processes
MPI_Scatter(rand_nums, elements_per_proc, MPI_FLOAT, sub_rand_nums,
            elements_per_proc, MPI_FLOAT, 0, MPI_COMM_WORLD);

// Compute the average of your subset
float sub_avg = compute_avg(sub_rand_nums, elements_per_proc);
// Gather all partial averages down to the root process
float *sub_avgs = NULL;
if (world_rank == 0) {
  sub_avgs = malloc(sizeof(float) * world_size);
}
MPI_Gather(&sub_avg, 1, MPI_FLOAT, sub_avgs, 1, MPI_FLOAT, 0,
           MPI_COMM_WORLD);

// Compute the total average of all numbers.
if (world_rank == 0) {
  float avg = compute_avg(sub_avgs, world_size);
}
  • many-to-many communication pattern
  • MPI_Allgather

MPI_Allgather(
    void* send_data,
    int send_count,
    MPI_Datatype send_datatype,
    void* recv_data,
    int recv_count,
    MPI_Datatype recv_datatype,
    MPI_Comm communicator)

与上面相似的例程

// Gather all partial averages down to all the processes
float *sub_avgs = (float *)malloc(sizeof(float) * world_size);
MPI_Allgather(&sub_avg, 1, MPI_FLOAT, sub_avgs, 1, MPI_FLOAT,
              MPI_COMM_WORLD);

// Compute the total average of all numbers.
float avg = compute_avg(sub_avgs, world_size);
  • 获取 mpi_datatype_size 的函数
 MPI_Type_size(datatype, &datatype_size); //the latter 是用来存储结果
  • reduce 是缩小算法规模的意思
MPI_Reduce(
    void* send_data,
    void* recv_data,  // 注意 recv_data 的大小是 sizeof(datatype) * count
    int count,  
    MPI_Datatype datatype,
    MPI_Op op,
    int root,
    MPI_Comm communicator)


reduce operation 操作符
- MPI_MAX - Returns the maximum element.最大值
- MPI_MIN - Returns the minimum element.最小值
- MPI_SUM - Sums the elements.总和
- MPI_PROD - Multiplies all elements.累乘
- MPI_LAND - Performs a logical and across the elements.逻辑与
- MPI_LOR - Performs a logical or across the elements.逻辑或
- MPI_BAND - Performs a bitwise and across the bits of the elements.按位与
- MPI_BOR - Performs a bitwise or across the bits of the elements.按位或
- MPI_MAXLOC - Returns the maximum value and the rank of the process that owns it.最大值及进程的秩
- MPI_MINLOC - Returns the minimum value and the rank of the process that owns it.最小值及进程的秩

  • Allreduce 很明显 参数少了个 root
MPI_Allreduce(
    void* send_data,
    void* recv_data,
    int count,
    MPI_Datatype datatype,
    MPI_Op op,
    MPI_Comm communicator)

总结一下 到目前为止 common collectives 有
  • [x] MPI_Bcast,
  • [x] MPI_Scatter
  • [x] MPI_Gather
  • [x] MPI_Reduce

分割全局的通讯器COMMUNICATOR

就像下图一样

所需要的函数

MPI_Comm_split(
    MPI_Comm comm,  //需要划分的communicator
    int color,    // 进程具有一样的 color 意味着是相同的 COMMUNICATOR
    int key,      // 这个值决定了进程在新的communicator里面的秩
    MPI_Comm* newcomm)  //返回的新的communicator

communicator 由ID 和 Group(set) 组成

MPI_Comm_group(
    MPI_Comm comm,
    MPI_Group* group)

求两个组的并集

MPI_Group_union(
    MPI_Group group1,
    MPI_Group group2,
    MPI_Group* newgroup)

求两个组的交集

MPI_Group_intersection(
    MPI_Group group1,
    MPI_Group group2,
    MPI_Group* newgroup)'

根据 rank 数组来提取出 group

MPI_Group_incl(
    MPI_Group group,
    int n,
    const int ranks[],
    MPI_Group* newgroup)

根据 group 产生 communicator

MPI_Comm_create_group(
    MPI_Comm comm,
    MPI_Group group,
    int tag,
    MPI_Comm* newcomm)

下面是例程

// Create a new communicator based on the group
MPI_Comm prime_comm;
MPI_Comm_create_group(MPI_COMM_WORLD, prime_group, 0, &prime_comm);

int prime_rank = -1, prime_size = -1;
// If this rank isn't in the new communicator, it will be
// MPI_COMM_NULL. Using MPI_COMM_NULL for MPI_Comm_rank or
// MPI_Comm_size is erroneous
if (MPI_COMM_NULL != prime_comm) {
    MPI_Comm_rank(prime_comm, &prime_rank);
    MPI_Comm_size(prime_comm, &prime_size);
}

判断 MPI_COMM_NULL 很重要 区分是否是是新 communicator 中的一员

你可能感兴趣的:(mpi)