I am trying to understand how aligned code is generated and therefore I have created the following code snippet
subroutine add(A, B, C, N)
implicit none
integer,intent(in) :: N
real*8, intent(in), dimension(N) :: A,B
real*8, intent(out), dimension(N) :: C
!dir$ assume_aligned A:32, B:32, C:32
!dir$ vector aligned
C = A+B
return
end subroutine add
which I am compiling with the following two commands:
ifort -align array32byte -S align.f90 -xavx
and
ifort -align array32byte -S align.f90
Since I am aligning the arrays to 32 byte (for AVX), explicitly adding a directive to assume that the arrays be aligned and even adding a !dir$ vector align directive (which shouldn't be necessary if I used assume_aligned - please correct me if I'm wrong) I would expect the assembler code to contain aligned loads and stores. However, I am seeing something interesting:
For the AVX code:
vmovupd (%rdi,%rax,8), %ymm0 #10.5
vmovupd 32(%rdi,%rax,8), %ymm2 #10.5
vmovupd 64(%rdi,%rax,8), %ymm4 #10.5
vmovupd 96(%rdi,%rax,8), %ymm6 #10.5
vaddpd (%rsi,%rax,8), %ymm0, %ymm1 #10.5
vaddpd 32(%rsi,%rax,8), %ymm2, %ymm3 #10.5
vaddpd 64(%rsi,%rax,8), %ymm4, %ymm5 #10.5
vaddpd 96(%rsi,%rax,8), %ymm6, %ymm7 #10.5
vmovupd %ymm1, (%rbx,%rax,8) #10.5
vmovupd %ymm3, 32(%rbx,%rax,8) #10.5
vmovupd %ymm5, 64(%rbx,%rax,8) #10.5
vmovupd %ymm7, 96(%rbx,%rax,8) #10.5
addq $16, %rax #10.5
cmpq %rcx, %rax #10.5
jb ..B1.4 # Prob 82% #10.5
For the SSE2 code:
movaps (%rdi,%rax,8), %xmm0 #10.5
movaps 16(%rdi,%rax,8), %xmm1 #10.5
movaps 32(%rdi,%rax,8), %xmm2 #10.5
movaps 48(%rdi,%rax,8), %xmm3 #10.5
addpd (%r8,%rax,8), %xmm0 #10.5
addpd 16(%r8,%rax,8), %xmm1 #10.5
addpd 32(%r8,%rax,8), %xmm2 #10.5
addpd 48(%r8,%rax,8), %xmm3 #10.5
movaps %xmm0, (%rdx,%rax,8) #10.5
movaps %xmm1, 16(%rdx,%rax,8) #10.5
movaps %xmm2, 32(%rdx,%rax,8) #10.5
movaps %xmm3, 48(%rdx,%rax,8) #10.5
addq $8, %rax #10.5
cmpq %rsi, %rax #10.5
jb ..B1.4 # Prob 82% #10.5
Apparently the AVX version does unaligned packed loads/stores even though I would assume that it would use vmovaps instead. In fact, I am unable to generate an example where I see vmovaps at all. The generated code isn't multiversioned either, so it's not that I've overseen something in the assembly. Also, why does the SSE2 code use aligned (which I expect), but single precision load instructions? Does that make any sense?
I have read this awesome article (https://software.intel.com/en-us/articles/fortran-array-data-and-arguments-and-vectorization) and I was sure that I was having the right idea of what the compiler's assembly output would look like. But unfortunately the results that I'm getting show a different behavior. Back to the example, I would expect the code to generate movapd/vmovapd instructions for SSE and AVX, respectively. What is my misconception? I am using ifort version 15.0.2.
Thank you!