Dear All,
there a various threads here as well as in other forums dealing with segmentation fault problems related to the stack size and openMP. Unfortunately, the information is (at least for my understanding) not consistent and--because some threads are rather old--it might also be outdated. My primary information is https://software.intel.com/en-us/articles/determining-root-cause-of-sigsegv-or-sigbus-errors/, where the recommendation for openMP applications is to remove the stack size limit of the operation system via
ulimit -s unlimited
or set it to a high value, rather than giving the compiler a limit (via -heap-arrays) for the size of arrays that it should put on the stack (i.e. all arrays known to compile time to be larger are put on the heap). Unfortunately, the other thread linked in the aforementioned (https://software.intel.com/en-us/articles/intel-fortran-compiler-increased-stack-usage-of-80-or-higher-compilers-causes-segmentation-fault) recommends using "-heap-arrays" independent if the application uses openMP or not. Also, https://software.intel.com/en-us/forums/topic/501500#comment-1779157 recommends not to set the stack size to unlimited.
I tried to set the stack size to a bigger size (default on Ubuntu is 8 kiB), but this sometimes causes problems with other software that is using threads. Also, it is hard for me to estimate the required stack size.
A critical part of the my code looks like
function mesh_build_cellnodes(nodes,Ncellnodes) implicit none integer(pInt), intent(in) :: Ncellnodes real(pReal), dimension(3,mesh_Nnodes), intent(in) :: nodes real(pReal), dimension(3,Ncellnodes) :: mesh_build_cellnodes integer(pInt) :: & e,t,n,m, & localCellnodeID real(pReal), dimension(3) :: & myCoords mesh_build_cellnodes = 0.0_pReal !$OMP PARALLEL DO PRIVATE(e,localCellnodeID,t,myCoords) do n = 1_pInt,Ncellnodes e = mesh_cellnodeParent(1,n) localCellnodeID = mesh_cellnodeParent(2,n) t = mesh_element(2,e) myCoords = 0.0_pReal do m = 1_pInt,FE_Nnodes(t) myCoords = myCoords + nodes(1:3,mesh_element(4_pInt+m,e)) & * FE_cellnodeParentnodeWeights(m,localCellnodeID,t) enddo mesh_build_cellnodes(1:3,n) = myCoords / sum(FE_cellnodeParentnodeWeights(:,localCellnodeID,t)) enddo !$OMP END PARALLEL DO end function mesh_build_cellnodes
where "mesh_Nnodes" can be in the range of 1000 to some Mio. According to https://software.intel.com/en-us/forums/topic/301590#comment-1524955, I understand that the "-heap-arrays" option will force the compiler to allocate "nodes" to the heap and not to the stack independently of the value given because its size is not known at compile time. A possible solution is, to give
-heap-arrays 8
as an compiler option since this seems to be a reasonable value for the stack size. In fact, on Ubuntu it is
ulimit -s 8192
My application runs fine with that (but would crash when I do not use the compiler option). The other possibility is still to unlimit the stack size (and omit the compiler option).
Since removing the stack size limit (and omitting the heap-arrays option) will significantly improve the performance at a slightly higher memory consumption (according to the time command, see attached file), I would rather use this option.
Therefore, my question is, if this method has any disadvantages for an application running with 1 to 32 threads.
Thanks in advance for your contributions and apologies for raising this question again.