Quantcast
Channel: Intel® Fortran Compiler
Viewing all articles
Browse latest Browse all 3270

Data Prefetching using Fortran Directives

$
0
0

Hi every one,

I am working on sparse algorithms' optimization using Intel's Fortran compiler. After applying different optimization features I want to make suitable use of data prefetching and cache utilization. In order to do that I tested several probable configurations of prefetching directives and intrinsic functions on both Intel Corei7 and AMD APU processors. But I don't get expected results. But in a specific case I think I get a real prefetching which gives me a 3-4 times speed up.

Following is the faster code:

    DOUBLE PRECISION, DIMENSION(:), ALLOCATABLE :: A2D, X, TEMP


	    DOUBLE PRECISION :: SUM


	    INTEGER :: SIZE, I, J, COUNT, BLS, I0


	   


	    SIZE = 1000000


	    BLS = 21 * 25


	    


	    ALLOCATE(A2D(0:BLS * SIZE - 1))


	    ALLOCATE(X(0:SIZE - 1))


	    ALLOCATE(TEMP(0:BLS - 1))


	    !DEC$ SIMD


	    DO J = 0, SIZE - 1


	        DO I = 0, BLS - 1


	            A2D(BLS * J + I) = I + J


	        END DO


	    END DO


	    DO COUNT = 0, 50


	        !$OMP PARALLEL SHARED(A2D, X, SIZE, BLS)


	            !$OMP DO SCHEDULE(STATIC) PRIVATE(J, I, SUM, TEMP, I0)            


	        !DEC$ SIMD


	        DO J = 0, SIZE - 1            


	            I0 = BLS * J


	            DO I = 0, BLS - 1


	                TEMP(I) = A2D(I0 + I)


	            END DO


	            SUM = 0.D0


	            DO I = 0, BLS - 1


	                SUM = SUM + TEMP(I) * 2.D0


	            END DO


	            X(J) = SUM


	        END DO


	            !$OMP END DO


	        !$OMP END PARALLEL


	    END DO

And the following is the code I expect to be correct but is around 4 times slower (I think because the prefetch directive does not work):

    DOUBLE PRECISION, DIMENSION(:), ALLOCATABLE :: A2D, x


	    DOUBLE PRECISION :: SUM


	    INTEGER :: SIZE, I, J, COUNT, BLS, I0
    SIZE = 1000000


	    BLS = 21 * 25


	    


	    ALLOCATE(A2D(0:BLS * SIZE - 1))


	    ALLOCATE(X(0:SIZE - 1))


	    !DEC$ SIMD


	    DO J = 0, SIZE - 1


	        DO I = 0, BLS - 1


	            A2D(BLS * J + I) = I + J


	        END DO


	    END DO


	    DO COUNT = 0, 50


	        !$OMP PARALLEL SHARED(A2D, X, SIZE, BLS)


	            !$OMP DO SCHEDULE(STATIC) PRIVATE(J, I, SUM, TEMP, I0, J_CACHE)            


	        !DEC$ PREFETCH A2D


	        DO J = 0, SIZE - 1            


	            I0 = BLS * J


	            SUM = 0.D0


	            !DEC$ SIMD


	            DO I = 0, BLS - 1


	                SUM = SUM + A2D(I0 + I) * 2.D0


	            END DO


	            X(J) = SUM


	        END DO


	            !$OMP END DO


	        !$OMP END PARALLEL


	    END DO

I am really confused and need your help.


Viewing all articles
Browse latest Browse all 3270

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>