MKL and OpenMP perform slower than simple DO loops

I would like to continue to post my performance problem with array operations a in a new topic. Some history can be found in https://software.intel.com/en-us/forums/intel-visual-fortran-compiler-for-windows/topic/605372#comment-1854000 and http://openmp.org/forum/viewtopic.php?f=3&t=1682 . Some of the participants in these previous discussions got better performance on their HW and compilers than me. I also was advised to go for MKL and this is part of this topic. The dissapointing message: neither OMP nor MKL are faster than simple Do loops on my laptop.

My questions are: is it that my hardware is not suited for parallel calculations, or did I forget to use some special compiler options, or does HT in my Win7x64 home premium SP1 impede the performance (don't know how to supress it)? I am attaching my processor details (bandwidth issue etc.).

This is the test code, trying to compare Do-loops, vector notation, OMP and MKL.

! TESTS 26.12.2015
! Test speed for array operation y(i)=a*x(i)*y(i) in 4 different ways
!
! "C:\Program Files (x86)\Intel\Composer XE 2013 SP1\mkl\bin\mklvars" intel64 mod
! "C:\Program Files (x86)\Intel\Composer XE 2013 SP1\bin\compilervars.bat" intel64
! ifort testMKLvsOpenMP.f90 /QopenMP /Qmkl
!    Intel(R) Visual Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 14.0.2.176 Build 20140130
!    Microsoft (R) Incremental Linker Version 9.00.21022.08
!     -out:testMKLvsOpenMP.exe
!     -subsystem:console
!     -defaultlib:libiomp5md.lib
!     -nodefaultlib:vcomp.lib
!     -nodefaultlib:vcompd.lib
!     "-libpath:C:\Program Files (x86)\Intel\Composer XE 2013 SP1\mkl\lib\intel64"
!     testMKLvsOpenMP.obj
!
program TestMKLvsOpenMP
use omp_lib
IMPLICIT NONE
integer :: N
real*8,Allocatable :: x(:),y(:)
real*8 :: alpha
real*8 :: endtime,starttime,DSECND
real :: cpu1,cpu2
integer :: NTHREADS,irepeat,nrepeat,i
! initialize
alpha=.0001
print *,'N=?'
read *,N
nrepeat=1000000000/N          ! nrepeat*N=10 Mio
print *,'nrepeat=',nrepeat

Allocate (x(N),y(N))
x(:)=0.  ; y(:)=0.
pause 'Press Return'

! 1. standard do loops
forall (i=1:N) ; x(i)=i ; y(i)=-i ;end forall
Nthreads=0
starttime = OMP_get_wtime()
Call cpu_time(cpu1)
do irepeat=1,nrepeat
 do i=1,N
   y(i)=alpha*x(i)+y(i)
 enddo
enddo
endtime = OMP_get_wtime() ; Call cpu_time(cpu2)
print *, 'Threads=',NTHREADS,' DO time=',SNGL(endtime - starttime),cpu2-cpu1
pause 'Press Return'

! 2. vector
forall (i=1:N) ; x(i)=i ; y(i)=-i ; end forall
Nthreads=0
Call cpu_time(cpu1)
do irepeat=1,nrepeat
   y(1:N)=alpha*x(1:N)+y(1:N)
enddo
Call cpu_time(cpu2)
print *, 'Threads=',NTHREADS,' Vector time=',cpu2-cpu1
pause 'Press Return'

Nthreads=2

! 3. OMP
forall (i=1:N) ;x(i)=i ; y(i)=-i ; end forall

CALL OMP_SET_NUM_THREADS(NTHREADS)
starttime = OMP_get_wtime() ; Call cpu_time(cpu1)

!$OMP PARALLEL Shared(N,x,y,alpha)
do irepeat=1,nrepeat
!$OMP DO  PRIVATE(i) SCHEDULE(static)
 do i=1,N
   y(i)=alpha*x(i)+y(i)
 enddo
!$OMP END DO nowait
enddo
!$OMP END PARALLEL

endtime = OMP_get_wtime() ; Call cpu_time(cpu2)
print *, 'OMP Threads=',NTHREADS,' OMPtime=',SNGL(endtime - starttime),(cpu2-cpu1)/Nthreads
pause 'Press Return'

! 4. MKL
forall (i=1:N) ; x(i)=i ; y(i)=-i ; end forall
starttime =DSECND() ; Call cpu_time(cpu1)
CALL MKL_SET_NUM_THREADS(NTHREADS)
do irepeat=1,nrepeat
  CALL daxpy(N,alpha,x,1, y ,1)
end do
endtime = DSECND(); Call cpu_time(cpu2)
print *, 'MKL Threads=',NTHREADS,' time=',SNGL(endtime - starttime),(cpu2-cpu1)/Nthreads

end

The results for N=1000000 (1 mio) and N=10000 are

 N=?
1000000
 nrepeat=        1000
Press Return
 Threads=           0  DO time=  0.1767103      0.1716011
Press Return
 Threads=           0  Vector time=  0.1716011
Press Return
 OMP Threads=           2  OMPtime=   1.397965       1.388409
Press Return
 MKL Threads=           2  time=   1.406852       1.404009

 N=?
10000
 nrepeat=      100000
Press Return
 Threads=           0  DO time=  0.1744737      0.1560010
Press Return
 Threads=           0  Vector time=  0.1716011
Press Return
 OMP Threads=           2  OMPtime=  0.2589355      0.2574016
Press Return
 MKL Threads=           2  time=  0.3295782      0.3120020

When I go down to N=100 OMP and MKL run slower

 OMP Threads=           2  OMPtime=  0.8096205      0.8112052
 MKL Threads=           2  time=  0.4926147      0.2418015

All comments are welcome.

Attachment	Size
Download WMIC_Bessel.txt	1.95 KB
Download CPU-Z HTML report file.pdf	67.12 KB
Download testMKLvsOpenMP.f90	2.59 KB

MKL and OpenMP perform slower than simple DO loops

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112