Same test code on different devices, one correct, one wrong

bsb3166 · Post by **bsb3166** » Wed Aug 31, 2011 5:52 pm

I tested the performance of magma routines on multiple CPUs with multiple GPUs codes. But the results I got from two devices are completely different. One shows correct result, the other shows suspicious result. How can this happen? Could it be the device problem or the bug in the code?

The routine I tested is magmaf_cgetrf_gpu. For the host routine, there is no such issue. Both devices can research the correct result.

The result on cuda1 which gives correct result:

Code: Select all

 Process             1  of             2  took GPU:             1
 Process             0  of             2  took GPU:             0
   runtime:    8652.217     ms
   runtime:    8652.270     ms
 
 Solving A x = b using LU factorization:
                         || A || =  5.228E+03
                         || b || =  9.999E-01
                         || x || =  4.428E+00
                   || b - A x || =  1.000E+00
   Gflops  =     82.73448000285711     
 
                Solution is CORRECT
 
 Solving A x = b using LU factorization:
                         || A || =  5.228E+03
                         || b || =  9.999E-01
                         || x || =  4.428E+00
                   || b - A x || =  1.000E+00
   Gflops  =     82.76268897634245     
 
                Solution is CORRECT

The result on cuda2 which gives suspicious result:

Code: Select all

 Process             0  of             2  took GPU:             0
 Process             1  of             2  took GPU:             1
   runtime:    10409.17     ms
 Info :             8
   runtime:    10409.17     ms
 
 Solving A x = b using LU factorization:
                         || A || =  5.228E+03
                         || b || =  9.999E-01
                         || x || =  9.999E-01
                   || b - A x || =  2.650E+03
   Gflops  =     139.3943775337270     
 
           Solution is suspicious,  8.304E+02
 
 Solving A x = b using LU factorization:
                         || A || =  5.228E+03
                         || b || =  9.999E-01
                         || x || =  9.999E-01
                   || b - A x || =  2.650E+03
   Gflops  =     68.76998791355494     
 
           Solution is suspicious,  8.304E+02

For the suspicious result, the x is the same with b. magmaf_cgetrf_gpu just does a copy instead of solving the equations. Does any one know why?

More information about devices is as below:
Device Info
cuda1 (correct)

Code: Select all

running on: n48

 gives a report of the devices that were found
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

There are 4 devices supporting CUDA

Device 0: "Tesla T10 Processor"
  CUDA Driver Version:                           3.20
  CUDA Runtime Version:                          3.20
  CUDA Capability Major/Minor version number:    1.3
  Total amount of global memory:                 4294770688 bytes
  Multiprocessors x Cores/MP = Cores:            30 (MP) x 8 (Cores/MP) = 240 (Cores)
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 16384
  Warp size:                                     32
  Maximum number of threads per block:           512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             256 bytes
  Clock rate:                                    1.44 GHz
  Concurrent copy and execution:                 Yes
  Run time limit on kernels:                     No
  Integrated:                                    No
  Support host page-locked memory mapping:       Yes
  Compute mode:                                  Default (multiple host threads can use this device simultaneously)
  Concurrent kernel execution:                   No
  Device has ECC support enabled:                No
  Device is using TCC driver mode:               No

Device 1: "Tesla T10 Processor"
......(same with Device 0)

Device 2: "Tesla T10 Processor"
......(same with Device 0)

Device 3: "Tesla T10 Processor"
......(same with Device 0)

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.20, CUDA Runtime Version = 3.20, NumDevs = 4, Device = Tesla T10 Processor, Device = Tesla T10 Processor


PASSED

Press <Enter> to Quit...

cuda2 (suspicious)

Code: Select all

running on: n53

 gives a report of the devices that were found
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

There are 3 devices supporting CUDA

Device 0: "Tesla M2070"
  CUDA Driver Version:                           3.20
  CUDA Runtime Version:                          3.20
  CUDA Capability Major/Minor version number:    2.0
  Total amount of global memory:                 5636554752 bytes
  Multiprocessors x Cores/MP = Cores:            14 (MP) x 32 (Cores/MP) = 448 (Cores)
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per block:           1024
  Maximum sizes of each dimension of a block:    1024 x 1024 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Clock rate:                                    1.15 GHz
  Concurrent copy and execution:                 Yes
  Run time limit on kernels:                     No
  Integrated:                                    No
  Support host page-locked memory mapping:       Yes
  Compute mode:                                  Default (multiple host threads can use this device simultaneously)
  Concurrent kernel execution:                   Yes
  Device has ECC support enabled:                Yes
  Device is using TCC driver mode:               No

Device 1: "Tesla M2070"
......(same with Device 0)

Device 2: "Tesla M2070"
......(same with Device 0)

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.20, CUDA Runtime Version = 3.20, NumDevs = 3, Device = Tesla M2070, Device = Tesla M2070


PASSED

More detail about the cuda2(suspicious) can be found here: http://h18004.www1.hp.com/products/quic ... _00205.pdf or
http://www.google.com/url?sa=t&source=w ... ZOjJVti5hQ
The hardware model is HP proliant SL390s g7 2U half-width server

Now I have updated the cuda drivers from 3.2 to 4.0. Still the same problem happened.

Test Code:
test_magma.f

Code: Select all

!
!   -- MAGMA (version 1.0) --
!      Univ. of Tennessee, Knoxville
!      Univ. of California, Berkeley
!      Univ. of Colorado, Denver
!      November 2010
!
!  @generated c
!
      program testing_cgetrf_gpu_f

      use magma
      use more_mpi
      integer :: rc, mydev, numdev 




      external cublas_init, cublas_set_matrix, cublas_get_matrix
      external cublas_shutdown, cublas_alloc
      external clange, cgemm, cgesv, slamch

      real clange, slamch
      integer cublas_alloc

      real              :: rnumber(2), Anorm, Bnorm, Rnorm, Xnorm 
      real, allocatable :: work(:)
      complex, allocatable       :: h_A(:), h_B(:), h_X(:)
      magma_devptr_t                :: devptrA, devptrB
      integer,    allocatable       :: ipiv(:)

      complex                    :: zone, mzone
      integer                       :: i, n, info, stat, lda
      integer                       :: size_of_elt, nrhs
      real(kind=8)                  :: flops, t
      integer                       :: tstart(2), tend(2)

      PARAMETER          ( nrhs = 1, zone = 1., mzone = -1. )
      integer c1,c2,cr,cm 
            
      
      call init
    
      ierr = cudaGetDeviceCount(numdev) 
      mydev = 0 !mod(cpuid,numdev) 
     
      print *, "Process ", cpuid, " of ", numprocs, " took GPU: ", mydev 
      ierr = cudaSetDevice(mydev) 
      
      call cublas_init()

      n   = 10240 ! 2048
      lda  = n
      ldda = ((n+31)/32)*32
      size_of_elt = sizeof_complex
 
!------ Allocate CPU memory
      allocate(h_A(lda*n))
      allocate(h_B(lda*nrhs))
      allocate(h_X(lda*nrhs))
      allocate(work(n))
      allocate(ipiv(n))

!------ Allocate GPU memory
      stat = cublas_alloc(ldda*n, size_of_elt, devPtrA)
      if (stat .ne. 0) then
         write(*,*) "device memory allocation failed"
         stop
      endif

      stat = cublas_alloc(ldda*nrhs, size_of_elt, devPtrB)
      if (stat .ne. 0) then
         write(*,*) "device memory allocation failed"
         stop
      endif

!---- Initializa the matrix
      do i=1,lda*n
         call random_number(rnumber)
         h_A(i) = rnumber(1)
      end do

      do i=1,lda*nrhs
        call random_number(rnumber)
        h_B(i) = rnumber(1)
      end do
      h_X(:) = h_B(:)

!---- devPtrA = h_A
      call cublas_set_matrix(n, n, size_of_elt, h_A, lda, devptrA, ldda)

!---- devPtrB = h_B
      call cublas_set_matrix(n, nrhs, size_of_elt, h_B, lda, devptrB, ldda)


      call MPI_BARRIER(mpi_comm_world,mpi_err)
      call system_clock( c1, cr, cm ) 

!---- Call magma LU ----------------
      call magma_gettime_f(tstart)
      call magmaf_cgetrf_gpu(n, n, devptrA, ldda, ipiv, info)

      call magma_gettime_f(tend)
      
      call MPI_BARRIER(mpi_comm_world,mpi_err)
      call system_clock( count=c2 ) 
      print *, '  runtime:', 1.e3*real(c2-c1) / real(cr), 'ms' 

      if ( info .ne. 0 )  then
         write(*,*) "Info : ", info
      end if

!---- Call magma solve -------------
      call magmaf_cgetrs_gpu('n', n, nrhs, devptrA, ldda, ipiv, devptrB, ldda, info)


      if ( info .ne. 0 )  then
         write(*,*) "Info : ", info
      end if

!---- h_X = devptrB
      call cublas_get_matrix (n, nrhs, size_of_elt, devptrB, ldda, h_X, lda)

!---- Compare the two results ------
      Anorm = clange('I', n, n,    h_A, lda, work)
      Bnorm = clange('I', n, nrhs, h_B, lda, work)
      Xnorm = clange('I', n, nrhs, h_X, lda, work)
      call cgemm('n', 'n', n,  nrhs, n, zone, h_A, lda, h_X, lda, mzone, h_B, lda)
      Rnorm = clange('I', n, nrhs, h_B, lda, work)

      write(*,*)
      write(*,*  ) 'Solving A x = b using LU factorization:'
      write(*,105) '  || A || = ', Anorm
      write(*,105) '  || b || = ', Bnorm
      write(*,105) '  || x || = ', Xnorm
      write(*,105) '  || b - A x || = ', Rnorm

      flops = 2. * n * n * n / 3.  
      call magma_gettimervalue_f(tstart, tend, t)

      write(*,*)   '  Gflops  = ',  flops / t / 1e6
      write(*,*)

      Rnorm = Rnorm / ( (Anorm*Xnorm+Bnorm) * n * slamch('E') )

      if ( Rnorm > 60. ) then
         write(*,105) '  Solution is suspicious, ', Rnorm
      else
         write(*,105) '  Solution is CORRECT' 
      end if

!---- Free CPU memory
      deallocate(h_A, h_X, h_B, work, ipiv)

!---- Free GPU memory
      call cublas_free(devPtrA)
      call cublas_free(devPtrB)
      call cublas_shutdown()

 105  format((a35,es10.3))


      call MPI_FINALIZE(rc)
      end

init.f

Code: Select all

module more_mpi
   include 'mpif.h'
   integer :: ierr,cpuid,numprocs,namelen !mpi
   character(len=100) processor_name
end module

more_mpi.f

Code: Select all

subroutine init
   use more_mpi
   call mpi_init(ierr)
   call mpi_comm_rank(mpi_comm_world,cpuid,ierr)
   call mpi_comm_size(mpi_comm_world,numprocs,ierr) 
   call mpi_get_processor_name(processor_name,namelen,ierr)
end subroutine init

makefile

Code: Select all

.SUFFIXES: .cuf .o

L1= test_magma.o more_mpi.o init.o fortran.o

CULAINCLUDES= -I${CULA_INC_PATH}
CULALIBPATH64= -L${CULA_LIB_PATH_64}


CUDAINCLUDES= -I${CUDA_INC_PATH}
CUDALIBPATH64= -L${CUDA_LIB_PATH_64}

CUDALIBS= -lcudart -lcuda -lcublas

MAGMALIBPATH=-L/u/af/bj/cding/MAGMA/lib

MAGMALIBS= -lmagma -lmagmablas

GPULIBS= -lcula_pgfortran #-lcula -lcublas -lcudart 

PGFLAGS= -Mfree -O3 -DADD_

#CUDA= -ta=nvidia -Mcuda
CUDA=


SOPT= 
LINK1=  /opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_scalapack_lp64.a \
       /opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_intel_lp64.a \
       /opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_blacs_openmpi_lp64.a \
       /opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_core.a \
       /opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_sequential.a \
       /opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_core.a \
       /opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_sequential.a \
       /opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_core.a \
       -lpthread

#LINK_CU=  /opt/pgi/linux86-64/10.6/lib/libcudafor.a
LINK_CU=  /opt/pgi/linux86-64/11.5/lib/libcudafor.a



PF90= mpif90

PGFOR= pgfortran


PGA_EX= magma_test_mgpus

main: $(L1)
	$(PF90) $(SOPT) $(PGFLAGS) $(L1) $(CUDAINCLUDES) $(CUDALIBPATH64) $(CUDALIBS) $(CULAINCLUDES) $(CULALIBPATH64) $(GPULIBS) $(CUDAINCLUDES) $(CUDALIBPATH64) $(CUDALIBS) $(MAGMALIBPATH) $(MAGMALIBS) $(LINK1) $(LINK_CU) -o $(PGA_EX)

.f.o:
	$(PF90) $(SOPT) $(PGFLAGS) -Mpreprocess -Dmagma_devptr_t="integer(kind=8)" -I/opt/development/gpu/current/cuda/include -I/u/af/bj/cding/download/magma_1.0.0-rc5/include -I/u/af/bj/cding/download/magma_1.0.0-rc5/include $(CUDAINCLUDES) $(CUDALIBPATH64) $(CUDALIBS) -c $<

.cuf.o:
	$(PGFOR) $(SOPT) $(PGFLAGS) $(CUDA) $(CULAINCLUDES) $(CULALIBPATH64) $(GPULIBS) -I/opt/development/gpu/current/cuda/include -I/u/af/bj/cding/download/magma_1.0.0-rc5/include -I/u/af/bj/cding/download/magma_1.0.0-rc5/include $(CUDAINCLUDES) $(CUDALIBPATH64) $(CUDALIBS) -c $<

.c.o:
	gcc -O3 -DCUBLAS_USE_THUNKING $(CUDAINCLUDES) -c $<

test_magma.o: test_magma.f init.o more_mpi.o

more_mpi.o: more_mpi.f

init.o: init.f more_mpi.o

fortran.o: fortran.c



clean:
	/bin/rm -f *o *mod $(L1b) $(L2b) $(PGA_EX) 
	
del:
	rm -f *.mio.mines.edu

execute command:

Code: Select all

mpiexec -np 2 magma_test_mgpus

bsb3166 · Post by **bsb3166** » Tue Sep 06, 2011 11:38 am

This is really weird I know. Did any one find the some problem?

MAGMA Forum

Same test code on different devices, one correct, one wrong

Same test code on different devices, one correct, one wrong

Re: Same test code on different devices, one correct, one wr