Hi every one,
I am working on sparse algorithms' optimization using Intel's Fortran compiler. After applying different optimization features I want to make suitable use of data prefetching and cache utilization. In order to do that I tested several probable configurations of prefetching directives and intrinsic functions on both Intel Corei7 and AMD APU processors. But I don't get expected results. But in a specific case I think I get a real prefetching which gives me a 3-4 times speed up.
Following is the faster code:
DOUBLE PRECISION, DIMENSION(:), ALLOCATABLE :: A2D, X, TEMP DOUBLE PRECISION :: SUM INTEGER :: SIZE, I, J, COUNT, BLS, I0 SIZE = 1000000 BLS = 21 * 25 ALLOCATE(A2D(0:BLS * SIZE - 1)) ALLOCATE(X(0:SIZE - 1)) ALLOCATE(TEMP(0:BLS - 1)) !DEC$ SIMD DO J = 0, SIZE - 1 DO I = 0, BLS - 1 A2D(BLS * J + I) = I + J END DO END DO DO COUNT = 0, 50 !$OMP PARALLEL SHARED(A2D, X, SIZE, BLS) !$OMP DO SCHEDULE(STATIC) PRIVATE(J, I, SUM, TEMP, I0) !DEC$ SIMD DO J = 0, SIZE - 1 I0 = BLS * J DO I = 0, BLS - 1 TEMP(I) = A2D(I0 + I) END DO SUM = 0.D0 DO I = 0, BLS - 1 SUM = SUM + TEMP(I) * 2.D0 END DO X(J) = SUM END DO !$OMP END DO !$OMP END PARALLEL END DO
And the following is the code I expect to be correct but is around 4 times slower (I think because the prefetch directive does not work):
DOUBLE PRECISION, DIMENSION(:), ALLOCATABLE :: A2D, x DOUBLE PRECISION :: SUM INTEGER :: SIZE, I, J, COUNT, BLS, I0 SIZE = 1000000 BLS = 21 * 25 ALLOCATE(A2D(0:BLS * SIZE - 1)) ALLOCATE(X(0:SIZE - 1)) !DEC$ SIMD DO J = 0, SIZE - 1 DO I = 0, BLS - 1 A2D(BLS * J + I) = I + J END DO END DO DO COUNT = 0, 50 !$OMP PARALLEL SHARED(A2D, X, SIZE, BLS) !$OMP DO SCHEDULE(STATIC) PRIVATE(J, I, SUM, TEMP, I0, J_CACHE) !DEC$ PREFETCH A2D DO J = 0, SIZE - 1 I0 = BLS * J SUM = 0.D0 !DEC$ SIMD DO I = 0, BLS - 1 SUM = SUM + A2D(I0 + I) * 2.D0 END DO X(J) = SUM END DO !$OMP END DO !$OMP END PARALLEL END DO
I am really confused and need your help.