The routine I tested is magmaf_cgetrf_gpu. For the host routine, there is no such issue. Both devices can research the correct result.
The result on cuda1 which gives correct result:
Code: Select all
Process 1 of 2 took GPU: 1
Process 0 of 2 took GPU: 0
runtime: 8652.217 ms
runtime: 8652.270 ms
Solving A x = b using LU factorization:
|| A || = 5.228E+03
|| b || = 9.999E-01
|| x || = 4.428E+00
|| b - A x || = 1.000E+00
Gflops = 82.73448000285711
Solution is CORRECT
Solving A x = b using LU factorization:
|| A || = 5.228E+03
|| b || = 9.999E-01
|| x || = 4.428E+00
|| b - A x || = 1.000E+00
Gflops = 82.76268897634245
Solution is CORRECTCode: Select all
Process 0 of 2 took GPU: 0
Process 1 of 2 took GPU: 1
runtime: 10409.17 ms
Info : 8
runtime: 10409.17 ms
Solving A x = b using LU factorization:
|| A || = 5.228E+03
|| b || = 9.999E-01
|| x || = 9.999E-01
|| b - A x || = 2.650E+03
Gflops = 139.3943775337270
Solution is suspicious, 8.304E+02
Solving A x = b using LU factorization:
|| A || = 5.228E+03
|| b || = 9.999E-01
|| x || = 9.999E-01
|| b - A x || = 2.650E+03
Gflops = 68.76998791355494
Solution is suspicious, 8.304E+02More information about devices is as below:
Device Info
cuda1 (correct)
Code: Select all
running on: n48
gives a report of the devices that were found
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
There are 4 devices supporting CUDA
Device 0: "Tesla T10 Processor"
CUDA Driver Version: 3.20
CUDA Runtime Version: 3.20
CUDA Capability Major/Minor version number: 1.3
Total amount of global memory: 4294770688 bytes
Multiprocessors x Cores/MP = Cores: 30 (MP) x 8 (Cores/MP) = 240 (Cores)
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Clock rate: 1.44 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)
Concurrent kernel execution: No
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device 1: "Tesla T10 Processor"
......(same with Device 0)
Device 2: "Tesla T10 Processor"
......(same with Device 0)
Device 3: "Tesla T10 Processor"
......(same with Device 0)
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.20, CUDA Runtime Version = 3.20, NumDevs = 4, Device = Tesla T10 Processor, Device = Tesla T10 Processor
PASSED
Press <Enter> to Quit...cuda2 (suspicious)
Code: Select all
running on: n53
gives a report of the devices that were found
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
There are 3 devices supporting CUDA
Device 0: "Tesla M2070"
CUDA Driver Version: 3.20
CUDA Runtime Version: 3.20
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 5636554752 bytes
Multiprocessors x Cores/MP = Cores: 14 (MP) x 32 (Cores/MP) = 448 (Cores)
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Clock rate: 1.15 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)
Concurrent kernel execution: Yes
Device has ECC support enabled: Yes
Device is using TCC driver mode: No
Device 1: "Tesla M2070"
......(same with Device 0)
Device 2: "Tesla M2070"
......(same with Device 0)
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.20, CUDA Runtime Version = 3.20, NumDevs = 3, Device = Tesla M2070, Device = Tesla M2070
PASSEDhttp://www.google.com/url?sa=t&source=w ... ZOjJVti5hQ
The hardware model is HP proliant SL390s g7 2U half-width server
Now I have updated the cuda drivers from 3.2 to 4.0. Still the same problem happened.
Test Code:
test_magma.f
Code: Select all
!
! -- MAGMA (version 1.0) --
! Univ. of Tennessee, Knoxville
! Univ. of California, Berkeley
! Univ. of Colorado, Denver
! November 2010
!
! @generated c
!
program testing_cgetrf_gpu_f
use magma
use more_mpi
integer :: rc, mydev, numdev
external cublas_init, cublas_set_matrix, cublas_get_matrix
external cublas_shutdown, cublas_alloc
external clange, cgemm, cgesv, slamch
real clange, slamch
integer cublas_alloc
real :: rnumber(2), Anorm, Bnorm, Rnorm, Xnorm
real, allocatable :: work(:)
complex, allocatable :: h_A(:), h_B(:), h_X(:)
magma_devptr_t :: devptrA, devptrB
integer, allocatable :: ipiv(:)
complex :: zone, mzone
integer :: i, n, info, stat, lda
integer :: size_of_elt, nrhs
real(kind=8) :: flops, t
integer :: tstart(2), tend(2)
PARAMETER ( nrhs = 1, zone = 1., mzone = -1. )
integer c1,c2,cr,cm
call init
ierr = cudaGetDeviceCount(numdev)
mydev = 0 !mod(cpuid,numdev)
print *, "Process ", cpuid, " of ", numprocs, " took GPU: ", mydev
ierr = cudaSetDevice(mydev)
call cublas_init()
n = 10240 ! 2048
lda = n
ldda = ((n+31)/32)*32
size_of_elt = sizeof_complex
!------ Allocate CPU memory
allocate(h_A(lda*n))
allocate(h_B(lda*nrhs))
allocate(h_X(lda*nrhs))
allocate(work(n))
allocate(ipiv(n))
!------ Allocate GPU memory
stat = cublas_alloc(ldda*n, size_of_elt, devPtrA)
if (stat .ne. 0) then
write(*,*) "device memory allocation failed"
stop
endif
stat = cublas_alloc(ldda*nrhs, size_of_elt, devPtrB)
if (stat .ne. 0) then
write(*,*) "device memory allocation failed"
stop
endif
!---- Initializa the matrix
do i=1,lda*n
call random_number(rnumber)
h_A(i) = rnumber(1)
end do
do i=1,lda*nrhs
call random_number(rnumber)
h_B(i) = rnumber(1)
end do
h_X(:) = h_B(:)
!---- devPtrA = h_A
call cublas_set_matrix(n, n, size_of_elt, h_A, lda, devptrA, ldda)
!---- devPtrB = h_B
call cublas_set_matrix(n, nrhs, size_of_elt, h_B, lda, devptrB, ldda)
call MPI_BARRIER(mpi_comm_world,mpi_err)
call system_clock( c1, cr, cm )
!---- Call magma LU ----------------
call magma_gettime_f(tstart)
call magmaf_cgetrf_gpu(n, n, devptrA, ldda, ipiv, info)
call magma_gettime_f(tend)
call MPI_BARRIER(mpi_comm_world,mpi_err)
call system_clock( count=c2 )
print *, ' runtime:', 1.e3*real(c2-c1) / real(cr), 'ms'
if ( info .ne. 0 ) then
write(*,*) "Info : ", info
end if
!---- Call magma solve -------------
call magmaf_cgetrs_gpu('n', n, nrhs, devptrA, ldda, ipiv, devptrB, ldda, info)
if ( info .ne. 0 ) then
write(*,*) "Info : ", info
end if
!---- h_X = devptrB
call cublas_get_matrix (n, nrhs, size_of_elt, devptrB, ldda, h_X, lda)
!---- Compare the two results ------
Anorm = clange('I', n, n, h_A, lda, work)
Bnorm = clange('I', n, nrhs, h_B, lda, work)
Xnorm = clange('I', n, nrhs, h_X, lda, work)
call cgemm('n', 'n', n, nrhs, n, zone, h_A, lda, h_X, lda, mzone, h_B, lda)
Rnorm = clange('I', n, nrhs, h_B, lda, work)
write(*,*)
write(*,* ) 'Solving A x = b using LU factorization:'
write(*,105) ' || A || = ', Anorm
write(*,105) ' || b || = ', Bnorm
write(*,105) ' || x || = ', Xnorm
write(*,105) ' || b - A x || = ', Rnorm
flops = 2. * n * n * n / 3.
call magma_gettimervalue_f(tstart, tend, t)
write(*,*) ' Gflops = ', flops / t / 1e6
write(*,*)
Rnorm = Rnorm / ( (Anorm*Xnorm+Bnorm) * n * slamch('E') )
if ( Rnorm > 60. ) then
write(*,105) ' Solution is suspicious, ', Rnorm
else
write(*,105) ' Solution is CORRECT'
end if
!---- Free CPU memory
deallocate(h_A, h_X, h_B, work, ipiv)
!---- Free GPU memory
call cublas_free(devPtrA)
call cublas_free(devPtrB)
call cublas_shutdown()
105 format((a35,es10.3))
call MPI_FINALIZE(rc)
end
Code: Select all
module more_mpi
include 'mpif.h'
integer :: ierr,cpuid,numprocs,namelen !mpi
character(len=100) processor_name
end module
Code: Select all
subroutine init
use more_mpi
call mpi_init(ierr)
call mpi_comm_rank(mpi_comm_world,cpuid,ierr)
call mpi_comm_size(mpi_comm_world,numprocs,ierr)
call mpi_get_processor_name(processor_name,namelen,ierr)
end subroutine initCode: Select all
.SUFFIXES: .cuf .o
L1= test_magma.o more_mpi.o init.o fortran.o
CULAINCLUDES= -I${CULA_INC_PATH}
CULALIBPATH64= -L${CULA_LIB_PATH_64}
CUDAINCLUDES= -I${CUDA_INC_PATH}
CUDALIBPATH64= -L${CUDA_LIB_PATH_64}
CUDALIBS= -lcudart -lcuda -lcublas
MAGMALIBPATH=-L/u/af/bj/cding/MAGMA/lib
MAGMALIBS= -lmagma -lmagmablas
GPULIBS= -lcula_pgfortran #-lcula -lcublas -lcudart
PGFLAGS= -Mfree -O3 -DADD_
#CUDA= -ta=nvidia -Mcuda
CUDA=
SOPT=
LINK1= /opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_scalapack_lp64.a \
/opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_intel_lp64.a \
/opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_blacs_openmpi_lp64.a \
/opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_core.a \
/opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_sequential.a \
/opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_core.a \
/opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_sequential.a \
/opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_core.a \
-lpthread
#LINK_CU= /opt/pgi/linux86-64/10.6/lib/libcudafor.a
LINK_CU= /opt/pgi/linux86-64/11.5/lib/libcudafor.a
PF90= mpif90
PGFOR= pgfortran
PGA_EX= magma_test_mgpus
main: $(L1)
$(PF90) $(SOPT) $(PGFLAGS) $(L1) $(CUDAINCLUDES) $(CUDALIBPATH64) $(CUDALIBS) $(CULAINCLUDES) $(CULALIBPATH64) $(GPULIBS) $(CUDAINCLUDES) $(CUDALIBPATH64) $(CUDALIBS) $(MAGMALIBPATH) $(MAGMALIBS) $(LINK1) $(LINK_CU) -o $(PGA_EX)
.f.o:
$(PF90) $(SOPT) $(PGFLAGS) -Mpreprocess -Dmagma_devptr_t="integer(kind=8)" -I/opt/development/gpu/current/cuda/include -I/u/af/bj/cding/download/magma_1.0.0-rc5/include -I/u/af/bj/cding/download/magma_1.0.0-rc5/include $(CUDAINCLUDES) $(CUDALIBPATH64) $(CUDALIBS) -c $<
.cuf.o:
$(PGFOR) $(SOPT) $(PGFLAGS) $(CUDA) $(CULAINCLUDES) $(CULALIBPATH64) $(GPULIBS) -I/opt/development/gpu/current/cuda/include -I/u/af/bj/cding/download/magma_1.0.0-rc5/include -I/u/af/bj/cding/download/magma_1.0.0-rc5/include $(CUDAINCLUDES) $(CUDALIBPATH64) $(CUDALIBS) -c $<
.c.o:
gcc -O3 -DCUBLAS_USE_THUNKING $(CUDAINCLUDES) -c $<
test_magma.o: test_magma.f init.o more_mpi.o
more_mpi.o: more_mpi.f
init.o: init.f more_mpi.o
fortran.o: fortran.c
clean:
/bin/rm -f *o *mod $(L1b) $(L2b) $(PGA_EX)
del:
rm -f *.mio.mines.edu
Code: Select all
mpiexec -np 2 magma_test_mgpus