Quantcast
Channel: Intel® Fortran Compiler
Viewing all articles
Browse latest Browse all 3270

Puzzle: changing the order of outer loops leads to significant performance increase

$
0
0

Hi,

I have a puzzling finding that changing the order of the outer loops led to significant performance increase. I am playing with the following two versions of a small code piece:

Version 1: ii, k, j, i

1529         do ii = iis, iie
1530           value = vals(ii)
1531           do k = ks, ke
1532             do j = js, je
1533               ind_offset = ( (k-1)*N2 + (j-1) ) * N1g
1534 !DIR$ SIMD
1535               do i = is, ie
1536                 l0 = ind_offset + i
1537                 dF(l0) = dF(l0) + value * F(l0 + ii)
1538               end do
1539             end do
1540           end do
1541         end do

Version 2: k, ii, j, i

1529         do k = ks, ke
1530           do ii = iis, iie
1531             value = vals(ii)
1532             do j = js, je
1533               ind_offset = ( (k-1)*N2 + (j-1) ) * N1g
1534 !DIR$ SIMD
1535               do i = is, ie
1536                 l0 = ind_offset + i
1537                 dF(l0) = dF(l0) + value * F(l0 + ii)
1538               end do
1539             end do
1540           end do
1541         end do

The ONLY difference between these two versions is the order of the outermost two loops: Version 1 has a loop order of ii, k, j, i while Version 2 has a loop order of k, ii, j, i. The profiling results of these two versions are summarized as below:

                      CPU Time(s)     Load Instructions     L1 Cache Hits     L2 Cache Hits     L3 Cache Hits     MainMemory Hits                       
Version 1           11.282                1.36E+10                  75.86%                  3.46%                    20.69%                0.00%
Version 2             7.372                1.36E+10                  94.76%                  1.24%                      4.00%                0.00%

The results really surprised me in two ways:
(1) I observed a non-trivial speedup 11.282/7.372 = 1.53 and a significant increase in L1 Cache Hits.
(2) The only change I made was rearranging the order of the two OUTER loops, i.e., do ii loop and do k loop.

I have checked the vectorization report and found the inner loop (do i loop) in both of the two versions have been vectorized. So now I really have no idea what is going on. I compiled the code using ifort 13.1.0 with -O2 -xHost. The loop bound (length) for each loop level is:

do ii loop: 5
do k loop: 48
do j  loop: 40
do i  loop: 36

I truly appreciate your time and help.

Best regards,
    Wentao


Viewing all articles
Browse latest Browse all 3270

Latest Images

Trending Articles



Latest Images