Dear all,
maybe some of you remember my first topic with this issue https://software.intel.com/en-us/forums/intel-visual-fortran-compiler-fo.... I am working with a code written in Fortran 77 and I was able to introduce parallelism using OpenMP. However, the
overhead effects are very high, especially when the Intel Fortran Compiler is used. For example, for a Intel Core i7 with 8 Threads I have the following execution times (in seconds):
Serial code: 811.3
ifort OpenMP, 1 thread: 864.4
ifort OpenMP, 8 threads: 169.1
So, you can see that the overhead introduced to the program when OpenMP is enabled is around 6.5% of the serial time, that seems excessive.
After some time and following the advices given in the first post I decided to give a second try to this issue. I profiled my program using Amplifier XE and the report shows a couple of subroutines that share most of the overhead time. For example, consider the following subroutine:
REAL*8 FUNCTION spin_rejection(qel,medium,elke,beta2,q1,cost, spin *_index,is_single) IMPLICIT NONE REAL*8 elke,beta2,q1,cost INTEGER*4 qel,medium LOGICAL spin_index,is_single COMMON/SPIN_DATA/ spin_rej(1,0:1,0: 31,0:15,0:31), espin_min,espin *_max,espml,b2spin_min,b2spin_max, dbeta2,dbeta2i,dlener,dleneri,dq *q1,dqq1i, fool_intel_optimizer REAL*4 spin_rej,espin_min,espin_max,espml,b2spin_min,b2spin_max, d *beta2,dbeta2i,dlener,dleneri,dqq1,dqq1i LOGICAL fool_intel_optimizer COMMON/RANDOMM/ rng_array(128), urndm(97), crndm, cdrndm, cmrndm, *i4opt, ixx, jxx, fool_optimizer, twom24, rng_seed C$OMP0THREADPRIVATE(/RANDOMM/) INTEGER*4 urndm, crndm, cdrndm, cmrndm, i4opt, ixx, jxx, fool_opti *mizer,rng_seed,rng_array REAL*4 twom24 REAL*8 rnno,ai,qq1,aj,xi,ak INTEGER*4 i,j,k SAVE i,j C$OMP0THREADPRIVATE(i,j) IF (( spin_index )) THEN spin_index = .false. IF (( beta2 .GE. b2spin_min )) THEN ai = (beta2 - b2spin_min)*dbeta2i i = ai ai = ai - i i = i + 15 + 1 ELSE IF(( elke .GT. espml )) THEN ai = (elke - espml)*dleneri i = ai ai = ai - i ELSE i = 0 ai = -1 END IF IF((rng_seed .GT. 128))CALL ranmar_get rnno = rng_array(rng_seed)*twom24 rng_seed = rng_seed + 1 IF((rnno .LT. ai))i = i + 1 IF (( is_single )) THEN j = 0 ELSE qq1 = 2*q1 qq1 = qq1/(1 + qq1) aj = qq1*dqq1i j = aj IF (( j .GE. 15 )) THEN j = 15 ELSE aj = aj - j IF((rng_seed .GT. 128))CALL ranmar_get rnno = rng_array(rng_seed)*twom24 rng_seed = rng_seed + 1 IF((rnno .LT. aj))j = j + 1 END IF END IF END IF xi = SQRT(0.5*(1-cost)) ak = xi*31 k = ak ak = ak - k spin_rejection = (1-ak)*spin_rej(medium,qel,i,j,k) + ak*spin_rej(m *edium,qel,i,j,k+1) RETURN END
The analysis shows that a lot of overhead occurs when the ranmar_get subroutine is called. It corresponds to:
SUBROUTINE ranmar_get IMPLICIT NONE COMMON/RANDOMM/ rng_array(128), urndm(97), crndm, cdrndm, cmrndm, *i4opt, ixx, jxx, fool_optimizer, twom24, rng_seed C$OMP0THREADPRIVATE(/RANDOMM/) INTEGER*4 urndm, crndm, cdrndm, cmrndm, i4opt, ixx, jxx, fool_opti *mizer,rng_seed,rng_array REAL*4 twom24 INTEGER*4 i,iopt IF((rng_seed .EQ. 999999))CALL init_ranmar DO 2511 i=1,128 iopt = urndm(ixx) - urndm(jxx) IF((iopt .LT. 0))iopt = iopt + 16777216 urndm(ixx) = iopt ixx = ixx - 1 jxx = jxx - 1 IF ((ixx .EQ. 0)) THEN ixx = 97 ELSE IF(( jxx .EQ. 0 )) THEN jxx = 97 END IF crndm = crndm - cdrndm IF((crndm .LT. 0))crndm = crndm + cmrndm iopt = iopt - crndm IF((iopt .LT. 0))iopt = iopt + 16777216 rng_array(i) = iopt 2511 CONTINUE 2512 CONTINUE rng_seed = 1 RETURN END
Well, I decided to give a look into the assembly of the spin_rejection function and, for example, at the line when ranmar_get is first called I obtained the following:
0x4202fd Block 196:
0x4202f8 9,019 callq 0x401bc0 <for_write_seq_lis_xmit>
0x4202f0 9,019 lea 0x340(%rsp), %rdx
0x4202e9 9,019 movl %r13d, 0x2d0(%rdi)
0x4202e4 9,019 lea 0x70(%rsp), %rdi
0x4202df 9,019 mov $0x480b14, %esi
0x4202df Block 195:
0x4202da 9,019 callq 0x401bc0 <for_write_seq_lis_xmit>
0x4202d2 9,019 lea 0x320(%rsp), %rdx
0x4202cb 9,019 movl %r15d, 0x2b0(%rdi)
0x4202c6 9,019 lea 0x70(%rsp), %rdi
0x4202c1 9,019 mov $0x480b0c, %esi
[...]
that pattern repeats through 500 lines of the assembly code... At the exit of the spin_rejection function I have several calls to the functions __kmpc_threadprivate_cached and __kmpc_global_thread_num, with a huge overhead time for the latter.
For some reason and remembering that someone points out that maybe is something related to bounds checking of arrays so I decided to reformat the RNG to avoid any use of arrays. Essentially now the generator is called each time a random number is needed and not each a certain number of expended random numbers. So now I have the following version of the RNG:
SUBROUTINE ranmar_get IMPLICIT NONE COMMON/RANDOMM/ rng_array, urndm(97), crndm, cdrndm, cmrndm, i4opt *, ixx, jxx, fool_optimizer, twom24, rng_seed C$OMP0THREADPRIVATE(/RANDOMM/) INTEGER*4 urndm, crndm, cdrndm, cmrndm, i4opt, ixx, jxx, fool_opti *mizer,rng_seed,rng_array REAL*4 twom24 INTEGER*4 iopt iopt = urndm(ixx) - urndm(jxx) IF((iopt .LT. 0))iopt = iopt + 16777216 urndm(ixx) = iopt ixx = ixx - 1 jxx = jxx - 1 IF ((ixx .EQ. 0)) THEN ixx = 97 ELSE IF(( jxx .EQ. 0 )) THEN jxx = 97 END IF crndm = crndm - cdrndm IF((crndm .LT. 0))crndm = crndm + cmrndm iopt = iopt - crndm IF((iopt .LT. 0))iopt = iopt + 16777216 rng_array = iopt RETURN END
the main change is that now rng_array is a single integer variable and not an array of a certain number of elements. So now the calling sequence in spin_rejection is simple CALL ranmar_get and I obtain the following assembly code for that section:
0x417ceb 8,959 sub %eax, %r8d 0x417ceb Block 25: 0x417ce4 8,959 addl 0x190(%r15), %eax 0x417ce4 Block 24: 0x417ce2 8,959 jns 0x417ceb <Block 25> 0x417ce0 8,959 sub %ebp, %eax 0x417cd9 8,959 movl 0x188(%r15), %eax 0x417cd2 8,959 movl 0x18c(%r15), %ebp 0x417cd2 Block 23: 0x417ccf 8,959 cmovz %eax, %ecx 0x417ccd 8,959 test %ecx, %ecx 0x417cc8 8,959 mov $0x61, %eax 0x417cc8 Block 22: 0x417cc6 8,959 jmp 0x417cd2 <Block 23> 0x417cc1 8,959 mov $0x61, %edx 0x417cc1 Block 21: 0x417cbf 8,959 jnz 0x417cc8 <Block 22> 0x417cbd 8,959 dec %edx 0x417cb9 8,959 movl %r8d, (%r15,%rdx,4) 0x417cb5 8,959 cmovs %eax, %r8d 0x417cae 8,959 lea 0x1000000(%r8), %eax 0x417cab 8,959 test %r8d, %r8d 0x417ca9 8,959 dec %ecx 0x417ca5 8,959 subl (%r15,%rcx,4), %r8d 0x417ca1 8,959 movl (%r15,%rdx,4), %r8d 0x417c9e 8,959 movsxd %ecx, %rcx 0x417c97 8,959 movl 0x19c(%r15), %ecx 0x417c94 8,959 movsxd %edx, %rdx 0x417c8d 8,959 movl 0x198(%r15), %edx
that is the entire assembly code for the first call of ranmar_get. Well, it is much simpler that the original version. The overhead cost of the call to ranmar_get inside spin_rejection also decreases heavily. __kmpc_threadprivate_cached and __kmpc_global_thread_num are also called at the end of the spin_rejection subroutine, but the overhead cost is much lower. I have starting believing that the use of arrays in the RNG is the source, in some way, of the overhead when OpenMP is enabled within the ifort compiler. Well, I have even tested other RNG with much simpler structures and I do not have any overhead cost pointed out by Amplifier XE. Even __kmpc_threadprivate_cached and __kmpc_global_thread_num are not called at all at the end of the spin_rejection subroutine.
I would like to know if something related with arrays could be the source of the overhead costs in my program, and if there is a way to alleviate this problem. I am not a programming expert so it is quite difficult to diagnose this problem and it limits the scalability of my program.
Thanks for your help!