Same test code on different devices, one correct, one wrong
Posted: Wed Aug 31, 2011 5:52 pm
I tested the performance of magma routines on multiple CPUs with multiple GPUs codes. But the results I got from two devices are completely different. One shows correct result, the other shows suspicious result. How can this happen? Could it be the device problem or the bug in the code?
The routine I tested is magmaf_cgetrf_gpu. For the host routine, there is no such issue. Both devices can research the correct result.
The result on cuda1 which gives correct result:
The result on cuda2 which gives suspicious result:
For the suspicious result, the x is the same with b. magmaf_cgetrf_gpu just does a copy instead of solving the equations. Does any one know why?
More information about devices is as below:
Device Info
cuda1 (correct)
cuda2 (suspicious)
More detail about the cuda2(suspicious) can be found here: http://h18004.www1.hp.com/products/quic ... _00205.pdf or
http://www.google.com/url?sa=t&source=w ... ZOjJVti5hQ
The hardware model is HP proliant SL390s g7 2U half-width server
Now I have updated the cuda drivers from 3.2 to 4.0. Still the same problem happened.
Test Code:
test_magma.f
init.f
more_mpi.f
makefile
execute command:
The routine I tested is magmaf_cgetrf_gpu. For the host routine, there is no such issue. Both devices can research the correct result.
The result on cuda1 which gives correct result:
Code: Select all
Process 1 of 2 took GPU: 1
Process 0 of 2 took GPU: 0
runtime: 8652.217 ms
runtime: 8652.270 ms
Solving A x = b using LU factorization:
|| A || = 5.228E+03
|| b || = 9.999E-01
|| x || = 4.428E+00
|| b - A x || = 1.000E+00
Gflops = 82.73448000285711
Solution is CORRECT
Solving A x = b using LU factorization:
|| A || = 5.228E+03
|| b || = 9.999E-01
|| x || = 4.428E+00
|| b - A x || = 1.000E+00
Gflops = 82.76268897634245
Solution is CORRECTCode: Select all
Process 0 of 2 took GPU: 0
Process 1 of 2 took GPU: 1
runtime: 10409.17 ms
Info : 8
runtime: 10409.17 ms
Solving A x = b using LU factorization:
|| A || = 5.228E+03
|| b || = 9.999E-01
|| x || = 9.999E-01
|| b - A x || = 2.650E+03
Gflops = 139.3943775337270
Solution is suspicious, 8.304E+02
Solving A x = b using LU factorization:
|| A || = 5.228E+03
|| b || = 9.999E-01
|| x || = 9.999E-01
|| b - A x || = 2.650E+03
Gflops = 68.76998791355494
Solution is suspicious, 8.304E+02More information about devices is as below:
Device Info
cuda1 (correct)
Code: Select all
running on: n48
gives a report of the devices that were found
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
There are 4 devices supporting CUDA
Device 0: "Tesla T10 Processor"
CUDA Driver Version: 3.20
CUDA Runtime Version: 3.20
CUDA Capability Major/Minor version number: 1.3
Total amount of global memory: 4294770688 bytes
Multiprocessors x Cores/MP = Cores: 30 (MP) x 8 (Cores/MP) = 240 (Cores)
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Clock rate: 1.44 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)
Concurrent kernel execution: No
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device 1: "Tesla T10 Processor"
......(same with Device 0)
Device 2: "Tesla T10 Processor"
......(same with Device 0)
Device 3: "Tesla T10 Processor"
......(same with Device 0)
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.20, CUDA Runtime Version = 3.20, NumDevs = 4, Device = Tesla T10 Processor, Device = Tesla T10 Processor
PASSED
Press <Enter> to Quit...cuda2 (suspicious)
Code: Select all
running on: n53
gives a report of the devices that were found
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
There are 3 devices supporting CUDA
Device 0: "Tesla M2070"
CUDA Driver Version: 3.20
CUDA Runtime Version: 3.20
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 5636554752 bytes
Multiprocessors x Cores/MP = Cores: 14 (MP) x 32 (Cores/MP) = 448 (Cores)
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Clock rate: 1.15 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)
Concurrent kernel execution: Yes
Device has ECC support enabled: Yes
Device is using TCC driver mode: No
Device 1: "Tesla M2070"
......(same with Device 0)
Device 2: "Tesla M2070"
......(same with Device 0)
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.20, CUDA Runtime Version = 3.20, NumDevs = 3, Device = Tesla M2070, Device = Tesla M2070
PASSEDhttp://www.google.com/url?sa=t&source=w ... ZOjJVti5hQ
The hardware model is HP proliant SL390s g7 2U half-width server
Now I have updated the cuda drivers from 3.2 to 4.0. Still the same problem happened.
Test Code:
test_magma.f
Code: Select all
!
! -- MAGMA (version 1.0) --
! Univ. of Tennessee, Knoxville
! Univ. of California, Berkeley
! Univ. of Colorado, Denver
! November 2010
!
! @generated c
!
program testing_cgetrf_gpu_f
use magma
use more_mpi
integer :: rc, mydev, numdev
external cublas_init, cublas_set_matrix, cublas_get_matrix
external cublas_shutdown, cublas_alloc
external clange, cgemm, cgesv, slamch
real clange, slamch
integer cublas_alloc
real :: rnumber(2), Anorm, Bnorm, Rnorm, Xnorm
real, allocatable :: work(:)
complex, allocatable :: h_A(:), h_B(:), h_X(:)
magma_devptr_t :: devptrA, devptrB
integer, allocatable :: ipiv(:)
complex :: zone, mzone
integer :: i, n, info, stat, lda
integer :: size_of_elt, nrhs
real(kind=8) :: flops, t
integer :: tstart(2), tend(2)
PARAMETER ( nrhs = 1, zone = 1., mzone = -1. )
integer c1,c2,cr,cm
call init
ierr = cudaGetDeviceCount(numdev)
mydev = 0 !mod(cpuid,numdev)
print *, "Process ", cpuid, " of ", numprocs, " took GPU: ", mydev
ierr = cudaSetDevice(mydev)
call cublas_init()
n = 10240 ! 2048
lda = n
ldda = ((n+31)/32)*32
size_of_elt = sizeof_complex
!------ Allocate CPU memory
allocate(h_A(lda*n))
allocate(h_B(lda*nrhs))
allocate(h_X(lda*nrhs))
allocate(work(n))
allocate(ipiv(n))
!------ Allocate GPU memory
stat = cublas_alloc(ldda*n, size_of_elt, devPtrA)
if (stat .ne. 0) then
write(*,*) "device memory allocation failed"
stop
endif
stat = cublas_alloc(ldda*nrhs, size_of_elt, devPtrB)
if (stat .ne. 0) then
write(*,*) "device memory allocation failed"
stop
endif
!---- Initializa the matrix
do i=1,lda*n
call random_number(rnumber)
h_A(i) = rnumber(1)
end do
do i=1,lda*nrhs
call random_number(rnumber)
h_B(i) = rnumber(1)
end do
h_X(:) = h_B(:)
!---- devPtrA = h_A
call cublas_set_matrix(n, n, size_of_elt, h_A, lda, devptrA, ldda)
!---- devPtrB = h_B
call cublas_set_matrix(n, nrhs, size_of_elt, h_B, lda, devptrB, ldda)
call MPI_BARRIER(mpi_comm_world,mpi_err)
call system_clock( c1, cr, cm )
!---- Call magma LU ----------------
call magma_gettime_f(tstart)
call magmaf_cgetrf_gpu(n, n, devptrA, ldda, ipiv, info)
call magma_gettime_f(tend)
call MPI_BARRIER(mpi_comm_world,mpi_err)
call system_clock( count=c2 )
print *, ' runtime:', 1.e3*real(c2-c1) / real(cr), 'ms'
if ( info .ne. 0 ) then
write(*,*) "Info : ", info
end if
!---- Call magma solve -------------
call magmaf_cgetrs_gpu('n', n, nrhs, devptrA, ldda, ipiv, devptrB, ldda, info)
if ( info .ne. 0 ) then
write(*,*) "Info : ", info
end if
!---- h_X = devptrB
call cublas_get_matrix (n, nrhs, size_of_elt, devptrB, ldda, h_X, lda)
!---- Compare the two results ------
Anorm = clange('I', n, n, h_A, lda, work)
Bnorm = clange('I', n, nrhs, h_B, lda, work)
Xnorm = clange('I', n, nrhs, h_X, lda, work)
call cgemm('n', 'n', n, nrhs, n, zone, h_A, lda, h_X, lda, mzone, h_B, lda)
Rnorm = clange('I', n, nrhs, h_B, lda, work)
write(*,*)
write(*,* ) 'Solving A x = b using LU factorization:'
write(*,105) ' || A || = ', Anorm
write(*,105) ' || b || = ', Bnorm
write(*,105) ' || x || = ', Xnorm
write(*,105) ' || b - A x || = ', Rnorm
flops = 2. * n * n * n / 3.
call magma_gettimervalue_f(tstart, tend, t)
write(*,*) ' Gflops = ', flops / t / 1e6
write(*,*)
Rnorm = Rnorm / ( (Anorm*Xnorm+Bnorm) * n * slamch('E') )
if ( Rnorm > 60. ) then
write(*,105) ' Solution is suspicious, ', Rnorm
else
write(*,105) ' Solution is CORRECT'
end if
!---- Free CPU memory
deallocate(h_A, h_X, h_B, work, ipiv)
!---- Free GPU memory
call cublas_free(devPtrA)
call cublas_free(devPtrB)
call cublas_shutdown()
105 format((a35,es10.3))
call MPI_FINALIZE(rc)
end
Code: Select all
module more_mpi
include 'mpif.h'
integer :: ierr,cpuid,numprocs,namelen !mpi
character(len=100) processor_name
end module
Code: Select all
subroutine init
use more_mpi
call mpi_init(ierr)
call mpi_comm_rank(mpi_comm_world,cpuid,ierr)
call mpi_comm_size(mpi_comm_world,numprocs,ierr)
call mpi_get_processor_name(processor_name,namelen,ierr)
end subroutine initCode: Select all
.SUFFIXES: .cuf .o
L1= test_magma.o more_mpi.o init.o fortran.o
CULAINCLUDES= -I${CULA_INC_PATH}
CULALIBPATH64= -L${CULA_LIB_PATH_64}
CUDAINCLUDES= -I${CUDA_INC_PATH}
CUDALIBPATH64= -L${CUDA_LIB_PATH_64}
CUDALIBS= -lcudart -lcuda -lcublas
MAGMALIBPATH=-L/u/af/bj/cding/MAGMA/lib
MAGMALIBS= -lmagma -lmagmablas
GPULIBS= -lcula_pgfortran #-lcula -lcublas -lcudart
PGFLAGS= -Mfree -O3 -DADD_
#CUDA= -ta=nvidia -Mcuda
CUDA=
SOPT=
LINK1= /opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_scalapack_lp64.a \
/opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_intel_lp64.a \
/opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_blacs_openmpi_lp64.a \
/opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_core.a \
/opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_sequential.a \
/opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_core.a \
/opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_sequential.a \
/opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_core.a \
-lpthread
#LINK_CU= /opt/pgi/linux86-64/10.6/lib/libcudafor.a
LINK_CU= /opt/pgi/linux86-64/11.5/lib/libcudafor.a
PF90= mpif90
PGFOR= pgfortran
PGA_EX= magma_test_mgpus
main: $(L1)
$(PF90) $(SOPT) $(PGFLAGS) $(L1) $(CUDAINCLUDES) $(CUDALIBPATH64) $(CUDALIBS) $(CULAINCLUDES) $(CULALIBPATH64) $(GPULIBS) $(CUDAINCLUDES) $(CUDALIBPATH64) $(CUDALIBS) $(MAGMALIBPATH) $(MAGMALIBS) $(LINK1) $(LINK_CU) -o $(PGA_EX)
.f.o:
$(PF90) $(SOPT) $(PGFLAGS) -Mpreprocess -Dmagma_devptr_t="integer(kind=8)" -I/opt/development/gpu/current/cuda/include -I/u/af/bj/cding/download/magma_1.0.0-rc5/include -I/u/af/bj/cding/download/magma_1.0.0-rc5/include $(CUDAINCLUDES) $(CUDALIBPATH64) $(CUDALIBS) -c $<
.cuf.o:
$(PGFOR) $(SOPT) $(PGFLAGS) $(CUDA) $(CULAINCLUDES) $(CULALIBPATH64) $(GPULIBS) -I/opt/development/gpu/current/cuda/include -I/u/af/bj/cding/download/magma_1.0.0-rc5/include -I/u/af/bj/cding/download/magma_1.0.0-rc5/include $(CUDAINCLUDES) $(CUDALIBPATH64) $(CUDALIBS) -c $<
.c.o:
gcc -O3 -DCUBLAS_USE_THUNKING $(CUDAINCLUDES) -c $<
test_magma.o: test_magma.f init.o more_mpi.o
more_mpi.o: more_mpi.f
init.o: init.f more_mpi.o
fortran.o: fortran.c
clean:
/bin/rm -f *o *mod $(L1b) $(L2b) $(PGA_EX)
del:
rm -f *.mio.mines.edu
Code: Select all
mpiexec -np 2 magma_test_mgpus