I would like to continue to post my performance problem with array operations a in a new topic. Some history can be found in https://software.intel.com/en-us/forums/intel-visual-fortran-compiler-for-windows/topic/605372#comment-1854000 and http://openmp.org/forum/viewtopic.php?f=3&t=1682 . Some of the participants in these previous discussions got better performance on their HW and compilers than me. I also was advised to go for MKL and this is part of this topic. The dissapointing message: neither OMP nor MKL are faster than simple Do loops on my laptop.
My questions are: is it that my hardware is not suited for parallel calculations, or did I forget to use some special compiler options, or does HT in my Win7x64 home premium SP1 impede the performance (don't know how to supress it)? I am attaching my processor details (bandwidth issue etc.).
This is the test code, trying to compare Do-loops, vector notation, OMP and MKL.
! TESTS 26.12.2015 ! Test speed for array operation y(i)=a*x(i)*y(i) in 4 different ways ! ! "C:\Program Files (x86)\Intel\Composer XE 2013 SP1\mkl\bin\mklvars" intel64 mod ! "C:\Program Files (x86)\Intel\Composer XE 2013 SP1\bin\compilervars.bat" intel64 ! ifort testMKLvsOpenMP.f90 /QopenMP /Qmkl ! Intel(R) Visual Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 14.0.2.176 Build 20140130 ! Microsoft (R) Incremental Linker Version 9.00.21022.08 ! -out:testMKLvsOpenMP.exe ! -subsystem:console ! -defaultlib:libiomp5md.lib ! -nodefaultlib:vcomp.lib ! -nodefaultlib:vcompd.lib ! "-libpath:C:\Program Files (x86)\Intel\Composer XE 2013 SP1\mkl\lib\intel64" ! testMKLvsOpenMP.obj ! program TestMKLvsOpenMP use omp_lib IMPLICIT NONE integer :: N real*8,Allocatable :: x(:),y(:) real*8 :: alpha real*8 :: endtime,starttime,DSECND real :: cpu1,cpu2 integer :: NTHREADS,irepeat,nrepeat,i ! initialize alpha=.0001 print *,'N=?' read *,N nrepeat=1000000000/N ! nrepeat*N=10 Mio print *,'nrepeat=',nrepeat Allocate (x(N),y(N)) x(:)=0. ; y(:)=0. pause 'Press Return' ! 1. standard do loops forall (i=1:N) ; x(i)=i ; y(i)=-i ;end forall Nthreads=0 starttime = OMP_get_wtime() Call cpu_time(cpu1) do irepeat=1,nrepeat do i=1,N y(i)=alpha*x(i)+y(i) enddo enddo endtime = OMP_get_wtime() ; Call cpu_time(cpu2) print *, 'Threads=',NTHREADS,' DO time=',SNGL(endtime - starttime),cpu2-cpu1 pause 'Press Return' ! 2. vector forall (i=1:N) ; x(i)=i ; y(i)=-i ; end forall Nthreads=0 Call cpu_time(cpu1) do irepeat=1,nrepeat y(1:N)=alpha*x(1:N)+y(1:N) enddo Call cpu_time(cpu2) print *, 'Threads=',NTHREADS,' Vector time=',cpu2-cpu1 pause 'Press Return' Nthreads=2 ! 3. OMP forall (i=1:N) ;x(i)=i ; y(i)=-i ; end forall CALL OMP_SET_NUM_THREADS(NTHREADS) starttime = OMP_get_wtime() ; Call cpu_time(cpu1) !$OMP PARALLEL Shared(N,x,y,alpha) do irepeat=1,nrepeat !$OMP DO PRIVATE(i) SCHEDULE(static) do i=1,N y(i)=alpha*x(i)+y(i) enddo !$OMP END DO nowait enddo !$OMP END PARALLEL endtime = OMP_get_wtime() ; Call cpu_time(cpu2) print *, 'OMP Threads=',NTHREADS,' OMPtime=',SNGL(endtime - starttime),(cpu2-cpu1)/Nthreads pause 'Press Return' ! 4. MKL forall (i=1:N) ; x(i)=i ; y(i)=-i ; end forall starttime =DSECND() ; Call cpu_time(cpu1) CALL MKL_SET_NUM_THREADS(NTHREADS) do irepeat=1,nrepeat CALL daxpy(N,alpha,x,1, y ,1) end do endtime = DSECND(); Call cpu_time(cpu2) print *, 'MKL Threads=',NTHREADS,' time=',SNGL(endtime - starttime),(cpu2-cpu1)/Nthreads end
The results for N=1000000 (1 mio) and N=10000 are
N=? 1000000 nrepeat= 1000 Press Return Threads= 0 DO time= 0.1767103 0.1716011 Press Return Threads= 0 Vector time= 0.1716011 Press Return OMP Threads= 2 OMPtime= 1.397965 1.388409 Press Return MKL Threads= 2 time= 1.406852 1.404009
N=? 10000 nrepeat= 100000 Press Return Threads= 0 DO time= 0.1744737 0.1560010 Press Return Threads= 0 Vector time= 0.1716011 Press Return OMP Threads= 2 OMPtime= 0.2589355 0.2574016 Press Return MKL Threads= 2 time= 0.3295782 0.3120020
When I go down to N=100 OMP and MKL run slower
OMP Threads= 2 OMPtime= 0.8096205 0.8112052 MKL Threads= 2 time= 0.4926147 0.2418015
All comments are welcome.