问题前的描述:用的是IBM的刀片机,6个节点,每个节点2个核,每个核4个核心。
内存是12G的内存。系统是Red Hat Enterprise Linux 5(64位版本的),
安装的软件是最新的intel fortran(l_cprof_p_11.1.059_intel64.tgz,64位版本的),
MPICH2(mpich2-1.0.tar.gz,64位版本)
程序中的数组基本上全是动态分配的,除非是非常小的数组。同样的程序当计算两块网格时
(程序中一个参数改为2,然后mpiexec -n 3 ./debugger)就能得到正确的结果,当腰计算
38块网格时(程序中一个参数改为2,然后mpiexec -n 39 ./debugger)就出现如下的错误:
(根进程用来收集计算好的数据,不参与计算,所以我这个程序所需要的进程要比网格数多1)
(为了排除堆栈溢出,内存不够等情况,已经使用了如下设置:
数据段长度:ulimit -d unlimited
最大内存大小:ulimit -m unlimited
堆栈大小:ulimit -s unlimited
CPU 时间:ulimit -t unlimited
虚拟内存:ulimit -v unlimited)
[test@mnode2 MrLuzhiliang]$ mpif90 -o debugger MPIscch.f90
[test@mnode2 MrLuzhiliang]$ mpiexec -n 39 ./debugger
proccess 1 :now the loop is starting......
proccess 2 :now the loop is starting......
proccess 3 :now the loop is starting......
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
debugger 000000000044B69F Unknown Unknown Unknown
debugger 0000000000414827 Unknown Unknown Unknown
debugger 0000000000403B0C Unknown Unknown Unknown
libc.so.6 000000355761D974 Unknown Unknown Unknown
debugger 0000000000403A19 Unknown Unknown Unknown
aborting job:
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(1058): MPI_Allreduce(sbuf=0x7fff4a23f234, rbuf=0x7fff4a23f238, count=1, MPI_INTEGER, MPI_SUM,
MPI_COMM_WORLD) failed
MPIR_Allreduce(545):
MPIC_Recv(98):
MPIC_Wait(308):
MPIDI_CH3_Progress_wait(207): an error occurred while handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(492):
connection_recv_fail(1728):
MPIDU_Socki_handle_read(590): connection closed by peer (set=0,sock=3)
aborting job:
Fatal error in MPI_Sendrecv: Internal MPI error!, error stack:
MPI_Sendrecv(207): MPI_Sendrecv(sbuf=0xaeff570, scount=1152, dtype=0x4c000430, dest=9, stag=10, rbuf=0xaf01ea0,
rcount=1152, dtype=0x4c000430, src=9, rtag=10, MPI_COMM_WORLD, status=0x7cd300) failed
(unknown)(): Internal MPI error!
rank 26 in job 10 mnode2_54617 caused collective abort of all ranks
exit status of rank 26: return code 13
rank 3 in job 10 mnode2_54617 caused collective abort of all ranks
exit status of rank 3: return code 174
rank 0 in job 10 mnode2_54617 caused collective abort of all ranks
exit status of rank 0: return code 13