Quantcast
Channel: Intel® Fortran Compiler
Viewing all articles
Browse latest Browse all 3270

error on many cores run

$
0
0

Hi

I have a code that works on a cluster when I use 6^3 = 216 cores, but the code crashes when I try to make it run with an higher resolution using a 12^3 = 1728 cores (all the parameters are the same except the grid spacing and the number of processors with which the code work).

We tried to see if it is a memory issue but even running the job with 16 tasks per nodes (108 nodes) didn't help.
I cannot debug the program with something like totalview because of the limit of processes these debuggers can manage.

I tried to compile the program with -O0 -g -traceback to get some better information in the error message.
When I add this options, even if the program crashes it runs until it expires the time I requested on the cluster.

In this case I get:
srun.slurm: Job step aborted: Waiting up to 2 seconds for job step to finish.
slurmstepd-borgt091: *** JOB 5787356 CANCELLED AT 2015-11-02T11:17:00 DUE TO TIME LIMIT on borgt091 ***
slurmstepd-borgt091: *** STEP 5787356.0 CANCELLED AT 2015-11-02T11:17:00 DUE TO TIME LIMIT on borgt091 ***
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source             
3dpic_full_mpi.ex  000000000088C169  Unknown               Unknown  Unknown
3dpic_full_mpi.ex  000000000088AA3E  Unknown               Unknown  Unknown
3dpic_full_mpi.ex  0000000000848F32  Unknown               Unknown  Unknown
3dpic_full_mpi.ex  0000000000815663  Unknown               Unknown  Unknown
3dpic_full_mpi.ex  0000000000819219  Unknown               Unknown  Unknown
libpthread.so.0    00002AAAAB669810  Unknown               Unknown  Unknown
libpthread.so.0    00002AAAAB6663D0  Unknown               Unknown  Unknown
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source             
3dpic_full_mpi.ex  000000000088C169  Unknown               Unknown  Unknown
3dpic_full_mpi.ex  000000000088AA3E  Unknown               Unknown  Unknown
3dpic_full_mpi.ex  0000000000848F32  Unknown               Unknown  Unknown
3dpic_full_mpi.ex  0000000000815663  Unknown               Unknown  Unknown
3dpic_full_mpi.ex  0000000000819219  Unknown               Unknown  Unknown
libpthread.so.0    00002AAAAB669810  Unknown               Unknown  Unknown
3dpic_full_mpi.ex  0000000000819140  Unknown               Unknown  Unknown
libpthread.so.0    00002AAAAB669810  Unknown               Unknown  Unknown
libmlx5-rdmav2.so  00002AAAACE3F4BB  Unknown               Unknown  Unknown

Stack trace terminated abnormally.

(more similar lines...)

I attach the complete error file (JOBID 5787356)

However, when I run the same simulation without the compiler options I get a different error and the job break down earlier:
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
3dpic_full_mpi.ex  0000000000869189  Unknown               Unknown  Unknown
3dpic_full_mpi.ex  0000000000867A5E  Unknown               Unknown  Unknown
3dpic_full_mpi.ex  0000000000825B72  Unknown               Unknown  Unknown
3dpic_full_mpi.ex  00000000007F2633  Unknown               Unknown  Unknown
3dpic_full_mpi.ex  00000000007F621B  Unknown               Unknown  Unknown
libpthread.so.0    00002AAAAB669810  Unknown               Unknown  Unknown
libc.so.6          00002AAAAC126C52  Unknown               Unknown  Unknown
3dpic_full_mpi.ex  00000000005389A2  Unknown               Unknown  Unknown
3dpic_full_mpi.ex  00000000004A6643  Unknown               Unknown  Unknown
3dpic_full_mpi.ex  0000000000462106  Unknown               Unknown  Unknown
3dpic_full_mpi.ex  000000000041B72F  Unknown               Unknown  Unknown
3dpic_full_mpi.ex  00000000004165C6  Unknown               Unknown  Unknown
libc.so.6          00002AAAAC02FC36  Unknown               Unknown  Unknown
3dpic_full_mpi.ex  00000000004164B9  Unknown               Unknown  Unknown
srun.slurm: error: borgo015: task 0: Exited with exit code 174
MPT ERROR: borgo021 has had continuous IB fabric problems for 10
    (MPI_WATCHDOG_TIMER) minutes trying to reach borgo015. Aborting.
MPT ERROR: borgo020 has had continuous IB fabric problems for 10
    (MPI_WATCHDOG_TIMER) minutes trying to reach borgo015. Aborting.
MPT: Global rank 32 is aborting with error code 0.
     Process ID: 12240, Host: borgo021, Program: /gpfsm/dnb32/gbrambil/Kcode/pulsarSILOF/3dpic_full_mpi.exe

(other stuff later)

I attach the error file of this job too (JOBID 5991137)

Do you have any idea of what the problem could be? I saw this topic https://software.intel.com/en-us/forums/intel-fortran-compiler-for-linux... does it work for my case too (I cannot use a debugger like this guy)?

P.S: in the error file it appears this line rm: cannot remove `pcrimth.dat': No such file or directory. Don't worry about it, it always appears but  the code runs.

Thanks


Viewing all articles
Browse latest Browse all 3270

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>