MPI调试--出错信息整理

如果是用FORTRAN写程序,建议加上implicit none,特别是代码比较多时,可以检查出编译过程中的很多问题。
1、

  • [root@c0108 parallel]# mpiexec -n 5 ./simple
  • aborting job:
  • Fatal error in MPI_Irecv: Invalid rank, error stack:
  • MPI_Irecv(143): MPI_Irecv(buf=0x25dab60, count=0, MPI_DOUBLE_PRECISION, src=5, tag=99, MPI_COMM_WORLD, request=0x7fffa02ca86c) failed
  • MPI_Irecv(95): Invalid rank has value 5 but must be nonnegative and less than 5
  • rank 4 in job 5  c0108_52041   caused collective abort of all ranks
  •   exit status of rank 4: return code 13


上面的意思是,进程号为5的无效,因为[root@c0108 parallel]# mpiexec -n 5 ./simple 运行的时候,开了5个进程:0 1 2 3 4,所以一定是代码本身的问题,但不一定是某个进程号本身,也有可能是某个 参数传递未成功等,MPI总会出现许多莫名的错误。。。
我的代码中MPI_Irecv语句有限,于是通过添加print语句的方法进行调试,找出错误代码所在的行,如下

print *, myid+1,'111111111111111111'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

call MPI_Irecv(P(1,1,location),IMAX*JMAX*MIN(ITSP, ke-myke),
     &MPI_DOUBLE_PRECISION,MYID+1,RELY,MPI_COMM_WORLD,REQ,IERR)
2、
  • [root@c0109 test]# mpiexec -n 5 ./simple
  • rank 3 in job 22  c0109_51164   caused collective abort of all ranks
  •   exit status of rank 3: killed by signal 11
  • [root@c0109 test]#
  • 其中signal 11是段错误。Signal 11, or officially know as "segmentation fault", means that the program accessed a memory location that was not assigned. That's usually a bug in the program.


3、

  • [root@c0108 test]# mpirun -np 4 ./simple
  • aborting job:
  • Fatal error in MPI_Wait: Invalid MPI_Request, error stack:
  • MPI_Wait(139): MPI_Wait(request=0x7fff1f675228, status0x7fff1f675218) failed
  • MPI_Wait(75): Invalid MPI_Request
  • rank 2 in job 24  c0108_52041   caused collective abort of all ranks
  •   exit status of rank 2: return code 13

solution:

generally it's because MPI_Test of MPI_Wait is supplied a request thatis unknown to MPICH (the request wasn't the one returned by MPICH whenyou made the Isend/Irecv/send_init/recv_init)就是说MPI_Irecv没有和MPI_Wait(req,status,IERR)对应,句柄对错号了。。如果MPI_Wait()函数有很多,可以采用注释的方法一个个锁定错误。。。另外:如果是FORTRAN程序,请首先检查一下status变量定义:integer req,status(MPI_STATUS_SIZE),ierr


4、
aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(195): Initialization failed MPID_Init(170): failure during portals initialization MPIDI_Portals_Init(321): progress_init failed MPIDI_PortalsI_Progress_init(653): Out of memory   


There is not enough memory on the nodes for the program plus MPI buffers to fit.


You can decrease the amount of memory that MPI is using for buffers by using MPICH_UNEX_BUFFER_SIZE environment variable.

欢迎批评指正, 多多 交流,谢谢!

你可能感兴趣的:(FORTRAN,MPI)