Hello All,
In the past, I have successfully created Fortran DLLs with OpenMP for use with Excel VBA. However, I would now like to integrate some CUDA C GPU code. I am trying to use the Fortran 2003 C interoperability features to make Intel Fortran talk to CUDA C. I have been able to create an executable which shows the expected behavior. However, when I compile it as a DLL and use inside Excel, it crashes without warning. There is no diagnostic information whatsoever. If anyone has observed this behavior and found a workaround, I would be glad to get any kind of help. My development configuration and test code are as follows.
Thanks in advance,
Sam V
Build setup: Win 6 x64; Microsoft Excel 2010 VBA; Intel Composer XE 2013 IA-32 with Visual Studio 2008; NVIDIA CUDA C v5.5
Example code:
Fortran code (excelcuda.f90)
uncommenting/commenting relevant lines for compilation as an executable)
!program main !implicit none !real*4::xx(4),yy(4) !xx=1.D0 !yy=2.D0 !write(*,*) xx, yy !call myarrtest(xx,yy,4) !write(*,*) xx, yy !end program subroutine myarrtest(arrin,arrout,sz1) !DEC$ ATTRIBUTES DLLEXPORT,STDCALL,REFERENCE,DECORATE,ALIAS:'myarrtest'::myarrtest !DEC$ ATTRIBUTES REFERENCE::arrin,arrout,sz1 USE, INTRINSIC :: ISO_C_BINDING implicit none INTERFACE SUBROUTINE kernel_wrapper (flt_a, flt_b, int_n) BIND(C) IMPORT INTEGER(C_INT), INTENT(IN) :: int_n REAL(C_FLOAT), INTENT(IN) :: flt_a(int_n), flt_b(int_n) END SUBROUTINE kernel_wrapper END INTERFACE integer*4::i integer*4,intent(in)::sz1 real*4,dimension(sz1),intent(in)::arrin real*4,dimension(sz1),intent(out)::arrout !do i=1,sz1 !arrout(i)=arrin(i)+arrout(i) !end do CALL kernel_wrapper(arrout, arrin, sz1) end subroutine
CUDA C kernel (cudakernel.cu)
#include <stdio.h> #include <stdlib.h> #include <string.h> #include <cuda.h> #include <cuda_runtime.h> // simple kernel function that adds two vectors __global__ void vect_add(float *a, float *b, int N) { int idx = threadIdx.x; if (idx<N) a[idx] = a[idx] + b[idx]; } // function called from main fortran program extern "C" void kernel_wrapper(float *a, float *b, int *Np) { float *a_d, *b_d; // declare GPU vector copies int blocks = 1; // uses 1 block of int N = *Np; // N threads on GPU // Allocate memory on GPU cudaMalloc( (void **)&a_d, sizeof(float) * N ); cudaMalloc( (void **)&b_d, sizeof(float) * N ); // copy vectors from CPU to GPU cudaMemcpy( a_d, a, sizeof(float) * N, cudaMemcpyHostToDevice ); cudaMemcpy( b_d, b, sizeof(float) * N, cudaMemcpyHostToDevice ); // call function on GPU vect_add<<< blocks, N >>>( a_d, b_d, N); // copy vectors back from GPU to CPU cudaMemcpy( a, a_d, sizeof(float) * N, cudaMemcpyDeviceToHost ); cudaMemcpy( b, b_d, sizeof(float) * N, cudaMemcpyDeviceToHost ); // free GPU memory cudaFree(a_d); cudaFree(b_d); return; }
The above pieces of code was compiled using the following commands
nvcc -c -m32 -O3 cudakernel.cu ifort -dll -libs:dll -iface:stdcall excelcuda.f90 cudakernal.obj cuda.lib cudart.lib
The resulting DLL is used within Excel VBA using the following statements
Declare Sub myarrtest Lib "excelcuda.dll" (ByRef x As Single, ByRef y As Single, ByRef n As Long) ... ... Call myarrtest(vbarr(1), fortarr(1), n1) ... ...