Quantcast
Channel: Intel® Fortran Compiler
Viewing all articles
Browse latest Browse all 3270

MKL and OpenMP perform slower than simple DO loops

$
0
0

I would like to continue to post my performance problem with array operations a in a new topic. Some history can be found in  https://software.intel.com/en-us/forums/intel-visual-fortran-compiler-for-windows/topic/605372#comment-1854000 and http://openmp.org/forum/viewtopic.php?f=3&t=1682 . Some of the participants in these previous discussions got better performance on their HW and compilers than me. I also was advised to go for MKL and this is part of this topic. The dissapointing message: neither OMP nor MKL are faster than simple Do loops on my laptop.

My questions are: is it that my hardware is not suited for parallel calculations, or did I forget to use some special compiler options, or does HT in my Win7x64 home premium SP1 impede the performance (don't know how to supress it)? I am attaching my processor details (bandwidth issue etc.).

This is the test code, trying to compare Do-loops, vector notation, OMP and MKL.

! TESTS 26.12.2015
! Test speed for array operation y(i)=a*x(i)*y(i) in 4 different ways
!
! "C:\Program Files (x86)\Intel\Composer XE 2013 SP1\mkl\bin\mklvars" intel64 mod
! "C:\Program Files (x86)\Intel\Composer XE 2013 SP1\bin\compilervars.bat" intel64
! ifort testMKLvsOpenMP.f90 /QopenMP /Qmkl
!    Intel(R) Visual Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 14.0.2.176 Build 20140130
!    Microsoft (R) Incremental Linker Version 9.00.21022.08
!     -out:testMKLvsOpenMP.exe
!     -subsystem:console
!     -defaultlib:libiomp5md.lib
!     -nodefaultlib:vcomp.lib
!     -nodefaultlib:vcompd.lib
!     "-libpath:C:\Program Files (x86)\Intel\Composer XE 2013 SP1\mkl\lib\intel64"
!     testMKLvsOpenMP.obj
!
program TestMKLvsOpenMP
use omp_lib
IMPLICIT NONE
integer :: N
real*8,Allocatable :: x(:),y(:)
real*8 :: alpha
real*8 :: endtime,starttime,DSECND
real :: cpu1,cpu2
integer :: NTHREADS,irepeat,nrepeat,i
! initialize
alpha=.0001
print *,'N=?'
read *,N
nrepeat=1000000000/N          ! nrepeat*N=10 Mio
print *,'nrepeat=',nrepeat

Allocate (x(N),y(N))
x(:)=0.  ; y(:)=0.
pause 'Press Return'

! 1. standard do loops
forall (i=1:N) ; x(i)=i ; y(i)=-i ;end forall
Nthreads=0
starttime = OMP_get_wtime()
Call cpu_time(cpu1)
do irepeat=1,nrepeat
 do i=1,N
   y(i)=alpha*x(i)+y(i)
 enddo
enddo
endtime = OMP_get_wtime() ; Call cpu_time(cpu2)
print *, 'Threads=',NTHREADS,' DO time=',SNGL(endtime - starttime),cpu2-cpu1
pause 'Press Return'

! 2. vector
forall (i=1:N) ; x(i)=i ; y(i)=-i ; end forall
Nthreads=0
Call cpu_time(cpu1)
do irepeat=1,nrepeat
   y(1:N)=alpha*x(1:N)+y(1:N)
enddo
Call cpu_time(cpu2)
print *, 'Threads=',NTHREADS,' Vector time=',cpu2-cpu1
pause 'Press Return'

Nthreads=2

! 3. OMP
forall (i=1:N) ;x(i)=i ; y(i)=-i ; end forall

CALL OMP_SET_NUM_THREADS(NTHREADS)
starttime = OMP_get_wtime() ; Call cpu_time(cpu1)

!$OMP PARALLEL Shared(N,x,y,alpha)
do irepeat=1,nrepeat
!$OMP DO  PRIVATE(i) SCHEDULE(static)
 do i=1,N
   y(i)=alpha*x(i)+y(i)
 enddo
!$OMP END DO nowait
enddo
!$OMP END PARALLEL

endtime = OMP_get_wtime() ; Call cpu_time(cpu2)
print *, 'OMP Threads=',NTHREADS,' OMPtime=',SNGL(endtime - starttime),(cpu2-cpu1)/Nthreads
pause 'Press Return'

! 4. MKL
forall (i=1:N) ; x(i)=i ; y(i)=-i ; end forall
starttime =DSECND() ; Call cpu_time(cpu1)
CALL MKL_SET_NUM_THREADS(NTHREADS)
do irepeat=1,nrepeat
  CALL daxpy(N,alpha,x,1, y ,1)
end do
endtime = DSECND(); Call cpu_time(cpu2)
print *, 'MKL Threads=',NTHREADS,' time=',SNGL(endtime - starttime),(cpu2-cpu1)/Nthreads

end

The results for N=1000000 (1 mio) and N=10000 are

 N=?
1000000
 nrepeat=        1000
Press Return
 Threads=           0  DO time=  0.1767103      0.1716011
Press Return
 Threads=           0  Vector time=  0.1716011
Press Return
 OMP Threads=           2  OMPtime=   1.397965       1.388409
Press Return
 MKL Threads=           2  time=   1.406852       1.404009
 N=?
10000
 nrepeat=      100000
Press Return
 Threads=           0  DO time=  0.1744737      0.1560010
Press Return
 Threads=           0  Vector time=  0.1716011
Press Return
 OMP Threads=           2  OMPtime=  0.2589355      0.2574016
Press Return
 MKL Threads=           2  time=  0.3295782      0.3120020

When I go down to N=100 OMP and MKL run slower

 OMP Threads=           2  OMPtime=  0.8096205      0.8112052
 MKL Threads=           2  time=  0.4926147      0.2418015

 All comments are welcome.


Viewing all articles
Browse latest Browse all 3270

Trending Articles