MAGMA + Fortran(gcc) ZGETRF Performance

tmusho · Post by **tmusho** » Fri Jan 14, 2011 4:14 pm

I am using magma_zgetrf in gfortran with pinned-memory. I am getting a correct solutions but the magma operation is 3-4times slower then the scalar lapack routine. I am
using iso_c_binding to pass pointer from fortran. Is there any suggestions what could be causing the performance decrease?

When I run testing_magma_zgetrf example I get the performance you would expect, gpu faster.

I have a attached a snippet of my code. A little messy from debugging.

Code: Select all

  
  module Gpu_infc

  use, intrinsic :: iso_c_binding

  implicit none!! Define GPU Variables
  complex(kind=8), dimension(:,:), pointer :: h_A
  type(C_PTR) :: cptr_h_A
  integer(kind=8) :: d_A !device pointer
  integer(kind=8) :: ldda
  integer(C_SIZE_T), parameter :: sizeof_complex = 16
  integer, parameter :: fp_kind = kind(0.0d0) ! Double precision

  ! Interface to cudaMallocHost and cudaFree
  interface
  ! cudaMallocHost
    integer (C_INT) function cudaMallocHost(buffer, size, flag)  bind(C,name="cudaMallocHost")
      use iso_c_binding
      implicit none
      type (C_PTR)  :: buffer
      integer (C_SIZE_T), value :: size
      integer (C_INT), value :: flag
    end function cudaMallocHost
    ! cudaFreeHost
    integer (C_INT) function cudaFreeHost(buffer)  bind(C,name="cudaFreeHost")
      use iso_c_binding
      implicit none
      type (C_PTR), value :: buffer
    end function cudaFreeHost
    integer  function cudaSetDeviceFlags(flag) bind(C,name="cudaSetDeviceFlags")
      use iso_c_binding
      implicit none 
      integer (C_INT), value :: flag
    end function cudaSetDeviceFlags
  end interface

!include 'mpif.h'

contains
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!
! Gpu_infc_init - Allocates gpu variables
!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
subroutine Gpu_infc_init(d_A,size_fp,ldda)

   integer cublas_Get_Error_, cublas_Init_, cublas_alloc_
   external cublas_Get_Error_, cublas_Init_, cublas_alloc_, printout_devices

   integer(kind=8), intent(in) :: size_fp
   integer(kind=8), intent(out) :: d_A
   integer(kind=8), intent(inout) :: ldda
   !complex(kind=8), dimension(:,:), intent(inout) :: h_A

   character*100 :: var
   integer :: stat, Np_t, ldda_t

   if(rank .eq. 0) write(*,*) ' ** Initializing GPU Cards'
   stat = cublas_Init_() !Initallize GPUs
   write(var,*) 'cublas_init - rank',rank
   call Errors_cublas(stat,var)
   if(rank .eq. 0) call printout_devices()
   if(rank .eq. 0) write(*,*) '    Number of GPUs used per node:',nproc

   !! Allocate CPU host memory
   !allocate(h_A(Np*ldda)) !for gpu
   !cudaHostAllocPortable=1, cudaHostAllocMapped= 2
   stat=cudaSetDeviceFlags(8)
   stat = cudaMallocHost(cptr_h_A, Np*Np*size_fp,2)
   write(*,*) 'cudaMallocHost =',stat
   Np_t = Np; ldda_t = ldda !need to be int*4
   call c_f_pointer (cptr_h_A,h_A,(/Np_t,Np_t/))
   if(rank .eq. 0) write(var,*) 'cublas_alloc - rank',rank
   call Errors_cublas(stat,var)

end subroutine Gpu_infc_init

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!
! Gpu_infc_LU2 - Allocates gpu variables
!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
subroutine Gpu_infc_LU2(h_A,lda,ldda,ipiv_gpu)

   integer magma_zgetrf, magma_zgetrf_gpu
   real(kind=8) get_current_time, Mpi_Wtime
   external magma_zgetrf, magma_zgetrf_gpu, get_current_time, Mpi_Wtime

   integer(kind=8), intent(in) :: lda, ldda
   integer(kind=4), dimension(:), intent(out) :: ipiv_gpu
   complex(kind=8), dimension(:,:), intent(inout) :: h_A
   integer(kind=8) :: m, n, lda_t

   character*100 :: var
   integer(kind=4) :: stat, info
   real(kind=8) :: end=0, start=0

   m = Np; n = Np; lda_t = lda

   !!Magma call doesn't require external set matrix
   !stat = magma_zgetrf(m, n, h_A , lda_t, ipiv_gpu, info) ! Purge Compute on gpu
   !write(*,*) 'stat1 = ',stat,info
   start = MPI_Wtime()
   stat = magma_zgetrf(m, n, h_A, lda_t, ipiv_gpu, info) ! Compute on gpu
   end = MPI_Wtime()
   write(*,*) 'stat2 = ',stat,info
   end = end - start
   write(*,'(A,F12.6)') 'magma time: ',end
   start = MPI_Wtime()
   call zgetrf_(Np,Np,h_A,lda,ipiv_gpu,info)
   end = MPI_Wtime()
   end = end - start
   write(*,'(A,F12.6)') 'lapack time: ',end

end subroutine Gpu_infc_LU2

Code: Select all

 Matrix Size: N, N*M =                  1501              2253001
 stat1 =            0           0
 stat2 =            0           0
magma time:     0.111086
 info =            0
lapack time:     0.031947

tmusho · Post by **tmusho** » Thu Jan 20, 2011 3:51 pm

So I think I have solved my own problem. It turns out it isn't a programming problem. The problem is the matrix I was trying to decompose was fairly sparse and didn't require much computational effort so the scalar lapack routine could do it quickly where magma had to spend time allocating device memory, etc. If I go back and put a random matrix in like the benchmark examples I see that magma is indeed faster.

Here is a snippet of the random matrix call and LU calls in Fortran.

Code: Select all

subroutine Gpu_infc_LU2(h_A,lda,ldda,ipiv_gpu)

   integer cublas_set_matrix_, cublas_get_matrix_, cublas_Get_Error_
   integer magma_zgetrf_gpu, magma_zgetrf
   external cublas_set_matrix_, cublas_get_matrix_, cublas_Get_Error_
   external magma_zgetrf_gpu, magma_zgetrf
   real(kind=8) get_current_time, Mpi_Wtime
   external get_current_time, Mpi_Wtime

   integer(kind=8), intent(in) :: lda, ldda
   integer(kind=4), dimension(:), intent(out) :: ipiv_gpu
   complex(kind=8), dimension(:,:), intent(inout) :: h_A
   integer(kind=4) :: m, n, lda_t, i, j

   character*100 :: var
   integer(kind=4) :: stat, info
   real(kind=8) :: end=0, start=0

   complex(kind=8),allocatable,dimension(:,:) :: h_A2
   integer, dimension(4) :: seed
   allocate(h_A2(Np,Np))
   seed(1)=124; seed(2)=352; seed(3)=753; seed(4)=977
   do i = 1, Np
    call zlarnv_(2,seed,Np,h_A(:,i))
   end do
  
   h_A2=h_A

   m = Np; n = Np; lda_t = lda
   write(*,*) 'Matrix Size: N, N*M = ', Np, Np*Np
   !!Magma call doesn't require external set matrix
   !do i=1, 20

   !stat = magma_zgetrf(m, n, h_A , lda_t, ipiv_gpu, info) ! Compute on gpu
   !stat = magma_zgetrf(m, n, cptr_h_A , lda_t, cptr_ipiv, info) ! Compute on gpu
   !write(*,*) 'stat1 = ',stat,info
   start = MPI_Wtime()
   stat = magma_zgetrf(m, n, h_A, lda_t, ipiv_gpu, info) ! Compute on gpu
   !stat = magma_zgetrf(m, n, cptr_h_A , lda_t, cptr_ipiv, info) ! Compute on gpu
   end = MPI_Wtime()
   !write(*,*) 'stat2 = ',stat,info
   end = end - start
   write(*,'(A,F12.6)') 'magma time: ',end
   start = MPI_Wtime()
   call zgetrf_(Np,Np,h_A2,lda,ipiv_gpu,info)
   end = MPI_Wtime()
   !write(*,*) 'info = ',info
   end = end - start
   write(*,'(A,F12.6)') 'lapack time: ',end
   do m = 1, Np
    !write(*,*) 'h_A,h_A2 =',h_A(m,m),h_A2(m,m)
   end do
   if (info <  0) then
     write (6,*) 'magma: the i-th argument had an illegal value',INFO
     if(info .eq. -7) write(6,*) 'internal GPU memory allocation failed.'
     stop 'ABORT'
   else if(info > 0) then
     write (6,*) 'magma: U(i,i) is exactly zero.',INFO
     !stop 'ABORT'
   end if
   !end do
   stop
end subroutine Gpu_infc_LU2

Here is output for a random complex matrix. Gpu is about 50times faster than a scalar lapack routine.

Code: Select all

 Matrix Size: N, N*M =                  1920              3686400
magma time:     0.229897
lapack time:    10.328367

Stan Tomov · Post by **Stan Tomov** » Fri Jan 21, 2011 2:54 am

We are going to post MAGMA 1.0 RC3 tomorrow. It will have an example on how to call magma from FORTRAN. Part of the release will be a timing function that enforces that previously started GPU computations are finished. Using this timing we get that the performance of MAGMA through C and FORTRAN testers is the same.

MAGMA Forum

MAGMA + Fortran(gcc) ZGETRF Performance

MAGMA + Fortran(gcc) ZGETRF Performance

Re: MAGMA + Fortran(gcc) ZGETRF Performance

Re: MAGMA + Fortran(gcc) ZGETRF Performance