Huge overhead cost with IFORT OpenMP

Dear all,

maybe some of you remember my first topic with this issue https://software.intel.com/en-us/forums/intel-visual-fortran-compiler-fo.... I am working with a code written in Fortran 77 and I was able to introduce parallelism using OpenMP. However, the
overhead effects are very high, especially when the Intel Fortran Compiler is used. For example, for a Intel Core i7 with 8 Threads I have the following execution times (in seconds):

Serial code: 811.3
ifort OpenMP, 1 thread: 864.4
ifort OpenMP, 8 threads: 169.1

So, you can see that the overhead introduced to the program when OpenMP is enabled is around 6.5% of the serial time, that seems excessive.
After some time and following the advices given in the first post I decided to give a second try to this issue. I profiled my program using Amplifier XE and the report shows a couple of subroutines that share most of the overhead time. For example, consider the following subroutine:

      REAL*8 FUNCTION spin_rejection(qel,medium,elke,beta2,q1,cost, spin
     *_index,is_single)
      IMPLICIT NONE
      REAL*8 elke,beta2,q1,cost
      INTEGER*4 qel,medium
      LOGICAL spin_index,is_single
      COMMON/SPIN_DATA/ spin_rej(1,0:1,0: 31,0:15,0:31), espin_min,espin
     *_max,espml,b2spin_min,b2spin_max, dbeta2,dbeta2i,dlener,dleneri,dq
     *q1,dqq1i, fool_intel_optimizer
      REAL*4 spin_rej,espin_min,espin_max,espml,b2spin_min,b2spin_max, d
     *beta2,dbeta2i,dlener,dleneri,dqq1,dqq1i
      LOGICAL fool_intel_optimizer
      COMMON/RANDOMM/ rng_array(128), urndm(97), crndm, cdrndm, cmrndm,
     *i4opt, ixx, jxx, fool_optimizer, twom24, rng_seed
C$OMP0THREADPRIVATE(/RANDOMM/)
      INTEGER*4 urndm, crndm, cdrndm, cmrndm, i4opt, ixx, jxx, fool_opti
     *mizer,rng_seed,rng_array
      REAL*4 twom24
      REAL*8 rnno,ai,qq1,aj,xi,ak
      INTEGER*4 i,j,k
      SAVE i,j
C$OMP0THREADPRIVATE(i,j)
      IF (( spin_index )) THEN
      spin_index = .false.
      IF (( beta2 .GE. b2spin_min )) THEN
      ai = (beta2 - b2spin_min)*dbeta2i
      i = ai
      ai = ai - i
      i = i + 15 + 1
      ELSE IF(( elke .GT. espml )) THEN
      ai = (elke - espml)*dleneri
      i = ai
      ai = ai - i
      ELSE
      i = 0
      ai = -1
      END IF
      IF((rng_seed .GT. 128))CALL ranmar_get
      rnno = rng_array(rng_seed)*twom24
      rng_seed = rng_seed + 1
      IF((rnno .LT. ai))i = i + 1
      IF (( is_single )) THEN
      j = 0
      ELSE
      qq1 = 2*q1
      qq1 = qq1/(1 + qq1)
      aj = qq1*dqq1i
      j = aj
      IF (( j .GE. 15 )) THEN
      j = 15
      ELSE
      aj = aj - j
      IF((rng_seed .GT. 128))CALL ranmar_get
      rnno = rng_array(rng_seed)*twom24
      rng_seed = rng_seed + 1
      IF((rnno .LT. aj))j = j + 1
      END IF
      END IF
      END IF
      xi = SQRT(0.5*(1-cost))
      ak = xi*31
      k = ak
      ak = ak - k
      spin_rejection = (1-ak)*spin_rej(medium,qel,i,j,k) + ak*spin_rej(m
     *edium,qel,i,j,k+1)
      RETURN
      END

The analysis shows that a lot of overhead occurs when the ranmar_get subroutine is called. It corresponds to:

      SUBROUTINE ranmar_get
      IMPLICIT NONE
      COMMON/RANDOMM/ rng_array(128), urndm(97), crndm, cdrndm, cmrndm,
     *i4opt, ixx, jxx, fool_optimizer, twom24, rng_seed
C$OMP0THREADPRIVATE(/RANDOMM/)
      INTEGER*4 urndm, crndm, cdrndm, cmrndm, i4opt, ixx, jxx, fool_opti
     *mizer,rng_seed,rng_array
      REAL*4 twom24
      INTEGER*4 i,iopt
      IF((rng_seed .EQ. 999999))CALL init_ranmar
      DO 2511 i=1,128
      iopt = urndm(ixx) - urndm(jxx)
      IF((iopt .LT. 0))iopt = iopt + 16777216
      urndm(ixx) = iopt
      ixx = ixx - 1
      jxx = jxx - 1
      IF ((ixx .EQ. 0)) THEN
      ixx = 97
      ELSE IF(( jxx .EQ. 0 )) THEN
      jxx = 97
      END IF
      crndm = crndm - cdrndm
      IF((crndm .LT. 0))crndm = crndm + cmrndm
      iopt = iopt - crndm
      IF((iopt .LT. 0))iopt = iopt + 16777216
      rng_array(i) = iopt
2511  CONTINUE
2512  CONTINUE
      rng_seed = 1
      RETURN
      END

Well, I decided to give a look into the assembly of the spin_rejection function and, for example, at the line when ranmar_get is first called I obtained the following:

0x4202fd Block 196:
0x4202f8 9,019 callq 0x401bc0 <for_write_seq_lis_xmit>
0x4202f0 9,019 lea 0x340(%rsp), %rdx
0x4202e9 9,019 movl %r13d, 0x2d0(%rdi)
0x4202e4 9,019 lea 0x70(%rsp), %rdi
0x4202df 9,019 mov $0x480b14, %esi
0x4202df Block 195:
0x4202da 9,019 callq 0x401bc0 <for_write_seq_lis_xmit>
0x4202d2 9,019 lea 0x320(%rsp), %rdx
0x4202cb 9,019 movl %r15d, 0x2b0(%rdi)
0x4202c6 9,019 lea 0x70(%rsp), %rdi
0x4202c1 9,019 mov $0x480b0c, %esi

[...]

that pattern repeats through 500 lines of the assembly code... At the exit of the spin_rejection function I have several calls to the functions __kmpc_threadprivate_cached and __kmpc_global_thread_num, with a huge overhead time for the latter.

For some reason and remembering that someone points out that maybe is something related to bounds checking of arrays so I decided to reformat the RNG to avoid any use of arrays. Essentially now the generator is called each time a random number is needed and not each a certain number of expended random numbers. So now I have the following version of the RNG:

      SUBROUTINE ranmar_get
      IMPLICIT NONE
      COMMON/RANDOMM/ rng_array, urndm(97), crndm, cdrndm, cmrndm, i4opt
     *, ixx, jxx, fool_optimizer, twom24, rng_seed
C$OMP0THREADPRIVATE(/RANDOMM/)
      INTEGER*4 urndm, crndm, cdrndm, cmrndm, i4opt, ixx, jxx, fool_opti
     *mizer,rng_seed,rng_array
      REAL*4 twom24
      INTEGER*4 iopt
      iopt = urndm(ixx) - urndm(jxx)
      IF((iopt .LT. 0))iopt = iopt + 16777216
      urndm(ixx) = iopt
      ixx = ixx - 1
      jxx = jxx - 1
      IF ((ixx .EQ. 0)) THEN
      ixx = 97
      ELSE IF(( jxx .EQ. 0 )) THEN
      jxx = 97
      END IF
      crndm = crndm - cdrndm
      IF((crndm .LT. 0))crndm = crndm + cmrndm
      iopt = iopt - crndm
      IF((iopt .LT. 0))iopt = iopt + 16777216
      rng_array = iopt
      RETURN
      END

the main change is that now rng_array is a single integer variable and not an array of a certain number of elements. So now the calling sequence in spin_rejection is simple CALL ranmar_get and I obtain the following assembly code for that section:

0x417ceb	8,959	sub %eax, %r8d
0x417ceb		Block 25:
0x417ce4	8,959	addl  0x190(%r15), %eax
0x417ce4		Block 24:
0x417ce2	8,959	jns 0x417ceb <Block 25>
0x417ce0	8,959	sub %ebp, %eax
0x417cd9	8,959	movl  0x188(%r15), %eax
0x417cd2	8,959	movl  0x18c(%r15), %ebp
0x417cd2		Block 23:
0x417ccf	8,959	cmovz %eax, %ecx
0x417ccd	8,959	test %ecx, %ecx
0x417cc8	8,959	mov $0x61, %eax
0x417cc8		Block 22:
0x417cc6	8,959	jmp 0x417cd2 <Block 23>
0x417cc1	8,959	mov $0x61, %edx
0x417cc1		Block 21:
0x417cbf	8,959	jnz 0x417cc8 <Block 22>
0x417cbd	8,959	dec %edx
0x417cb9	8,959	movl  %r8d, (%r15,%rdx,4)
0x417cb5	8,959	cmovs %eax, %r8d
0x417cae	8,959	lea 0x1000000(%r8), %eax
0x417cab	8,959	test %r8d, %r8d
0x417ca9	8,959	dec %ecx
0x417ca5	8,959	subl  (%r15,%rcx,4), %r8d
0x417ca1	8,959	movl  (%r15,%rdx,4), %r8d
0x417c9e	8,959	movsxd %ecx, %rcx
0x417c97	8,959	movl  0x19c(%r15), %ecx
0x417c94	8,959	movsxd %edx, %rdx
0x417c8d	8,959	movl  0x198(%r15), %edx

that is the entire assembly code for the first call of ranmar_get. Well, it is much simpler that the original version. The overhead cost of the call to ranmar_get inside spin_rejection also decreases heavily. __kmpc_threadprivate_cached and __kmpc_global_thread_num are also called at the end of the spin_rejection subroutine, but the overhead cost is much lower. I have starting believing that the use of arrays in the RNG is the source, in some way, of the overhead when OpenMP is enabled within the ifort compiler. Well, I have even tested other RNG with much simpler structures and I do not have any overhead cost pointed out by Amplifier XE. Even __kmpc_threadprivate_cached and __kmpc_global_thread_num are not called at all at the end of the spin_rejection subroutine.

I would like to know if something related with arrays could be the source of the overhead costs in my program, and if there is a way to alleviate this problem. I am not a programming expert so it is quite difficult to diagnose this problem and it limits the scalability of my program.

Thanks for your help!

Huge overhead cost with IFORT OpenMP - Part 2

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List