Hi
I have a code that works on a cluster when I use 6^3 = 216 cores, but the code crashes when I try to make it run with an higher resolution using a 12^3 = 1728 cores (all the parameters are the same except the grid spacing and the number of processors with which the code work).
We tried to see if it is a memory issue but even running the job with 16 tasks per nodes (108 nodes) didn't help.
I cannot debug the program with something like totalview because of the limit of processes these debuggers can manage.
I tried to compile the program with -O0 -g -traceback to get some better information in the error message.
When I add this options, even if the program crashes it runs until it expires the time I requested on the cluster.
In this case I get:
srun.slurm: Job step aborted: Waiting up to 2 seconds for job step to finish.
slurmstepd-borgt091: *** JOB 5787356 CANCELLED AT 2015-11-02T11:17:00 DUE TO TIME LIMIT on borgt091 ***
slurmstepd-borgt091: *** STEP 5787356.0 CANCELLED AT 2015-11-02T11:17:00 DUE TO TIME LIMIT on borgt091 ***
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
3dpic_full_mpi.ex 000000000088C169 Unknown Unknown Unknown
3dpic_full_mpi.ex 000000000088AA3E Unknown Unknown Unknown
3dpic_full_mpi.ex 0000000000848F32 Unknown Unknown Unknown
3dpic_full_mpi.ex 0000000000815663 Unknown Unknown Unknown
3dpic_full_mpi.ex 0000000000819219 Unknown Unknown Unknown
libpthread.so.0 00002AAAAB669810 Unknown Unknown Unknown
libpthread.so.0 00002AAAAB6663D0 Unknown Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
3dpic_full_mpi.ex 000000000088C169 Unknown Unknown Unknown
3dpic_full_mpi.ex 000000000088AA3E Unknown Unknown Unknown
3dpic_full_mpi.ex 0000000000848F32 Unknown Unknown Unknown
3dpic_full_mpi.ex 0000000000815663 Unknown Unknown Unknown
3dpic_full_mpi.ex 0000000000819219 Unknown Unknown Unknown
libpthread.so.0 00002AAAAB669810 Unknown Unknown Unknown
3dpic_full_mpi.ex 0000000000819140 Unknown Unknown Unknown
libpthread.so.0 00002AAAAB669810 Unknown Unknown Unknown
libmlx5-rdmav2.so 00002AAAACE3F4BB Unknown Unknown Unknown
Stack trace terminated abnormally.
(more similar lines...)
I attach the complete error file (JOBID 5787356)
However, when I run the same simulation without the compiler options I get a different error and the job break down earlier:
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
3dpic_full_mpi.ex 0000000000869189 Unknown Unknown Unknown
3dpic_full_mpi.ex 0000000000867A5E Unknown Unknown Unknown
3dpic_full_mpi.ex 0000000000825B72 Unknown Unknown Unknown
3dpic_full_mpi.ex 00000000007F2633 Unknown Unknown Unknown
3dpic_full_mpi.ex 00000000007F621B Unknown Unknown Unknown
libpthread.so.0 00002AAAAB669810 Unknown Unknown Unknown
libc.so.6 00002AAAAC126C52 Unknown Unknown Unknown
3dpic_full_mpi.ex 00000000005389A2 Unknown Unknown Unknown
3dpic_full_mpi.ex 00000000004A6643 Unknown Unknown Unknown
3dpic_full_mpi.ex 0000000000462106 Unknown Unknown Unknown
3dpic_full_mpi.ex 000000000041B72F Unknown Unknown Unknown
3dpic_full_mpi.ex 00000000004165C6 Unknown Unknown Unknown
libc.so.6 00002AAAAC02FC36 Unknown Unknown Unknown
3dpic_full_mpi.ex 00000000004164B9 Unknown Unknown Unknown
srun.slurm: error: borgo015: task 0: Exited with exit code 174
MPT ERROR: borgo021 has had continuous IB fabric problems for 10
(MPI_WATCHDOG_TIMER) minutes trying to reach borgo015. Aborting.
MPT ERROR: borgo020 has had continuous IB fabric problems for 10
(MPI_WATCHDOG_TIMER) minutes trying to reach borgo015. Aborting.
MPT: Global rank 32 is aborting with error code 0.
Process ID: 12240, Host: borgo021, Program: /gpfsm/dnb32/gbrambil/Kcode/pulsarSILOF/3dpic_full_mpi.exe
(other stuff later)
I attach the error file of this job too (JOBID 5991137)
Do you have any idea of what the problem could be? I saw this topic https://software.intel.com/en-us/forums/intel-fortran-compiler-for-linux... does it work for my case too (I cannot use a debugger like this guy)?
P.S: in the error file it appears this line rm: cannot remove `pcrimth.dat': No such file or directory. Don't worry about it, it always appears but the code runs.
Thanks