MAGMA Forum

Posted: **Mon Dec 26, 2016 3:14 am**

I followed the make example for make.inc-mkl-gcc in a system with

Pascal GPU gtx 1060
MKL 2017
Cuda 8.0
Gcc 4.9
Dual intel xeon (32 logic cores)

I run test provided from magma source dir and got about 10x times performance on cuBlas compared to cpu blas, all good but when it reached the lapack testing I got lots of failed result that seem to occur regardless of matrix size, some big matrix passed but mostly fails. 1/4 of lapack testing fails.

So i thought maybe it was just mkl problem, so I switch to OpenBlas+Gcc4.9+Cuda8, I use the openblas gcc make.inc of course, however I got the same results. Lapack failure occurs exactly were magma-mkl fails.

I couldnt see any problem in the supplied make.inc examples as all cuBlas related test passed graciously with flying colors however fails 1/4 of it in lapack tests on either mkl or openblas, how could this be resolved?

Posted: **Wed Dec 28, 2016 3:07 pm**

Which routines passed & which failed? Can you post failures of some routines? Please include the complete input & output so we know what command line you used. Please also include your make.inc file and any environment variables you set (e.g., CUDADIR, GPU_TARGET).

I assume this is on Linux?

-mark

Posted: **Thu Dec 29, 2016 5:03 pm**

Hi,

Yes this is on Linux,

The command line I used for testing is

Code: Select all

./run_tests.py

in the testing directory

Make file is make.inc.mkl-gcc, I did not change anything there except for Paths and GPU_TARGET, here is it

Code: Select all

GPU_TARGET ?= Pascal

CC        = gcc
CXX       = g++
NVCC      = /usr/local/cuda/bin/nvcc
FORT      = gfortran

ARCH      = ar
ARCHFLAGS = cr
RANLIB    = ranlib

FPIC      = -fPIC

CFLAGS    = -O3 $(FPIC) -fopenmp -DNDEBUG -DADD_ -Wall -Wshadow -DMAGMA_WITH_MKL
FFLAGS    = -O3 $(FPIC)          -DNDEBUG -DADD_ -Wall -Wno-unused-dummy-argument
F90FLAGS  = -O3 $(FPIC)          -DNDEBUG -DADD_ -Wall -Wno-unused-dummy-argument -x f95-cpp-input
NVCCFLAGS = -O3                  -DNDEBUG -DADD_ -Xcompiler "$(FPIC) -Wall -Wno-unused-function"
LDFLAGS   =     $(FPIC) -fopenmp

CXXFLAGS := $(CFLAGS) -std=c++11
CFLAGS   += -std=c99

LIB       = -lmkl_gf_lp64 -lmkl_gnu_thread -lmkl_core -lpthread -lstdc++ -lm -lgfortran
LIB      += -lcublas -lcusparse -lcudart -lcudadevrt

MKLROOT ?= /opt/intel/mkl
CUDADIR ?= /usr/local/cuda
-include make.check-mkl
-include make.check-cuda

LIBDIR    = -L$(CUDADIR)/lib64 \
            -L$(MKLROOT)/lib/intel64

INC       = -I$(CUDADIR)/include \
            -I$(MKLROOT)/include

I did the same for make.inc.openblas but updated the directories and nvcc path.

I wish to attach the ./run_tests.py output but our workstations are down now due to December/January maintenance, there is nothing unusuall in the table outputed by the ./run_tests.py its just all cuBlas related error check passed but when It reach the lapack error check (last column) there were too much test failed.

As I can remember its those tests involving lapack error check as the last column from

Code: Select all

testing_c**** routine

however, not all steps have failures, but mostly those involving bigger matrix. Failed tests never stops coming out when it reached those routines so I find it unusual as compared to cuBlas error check that passed all, I terminated it before it could reach other routines

I hope that this could help us in coming up for a possible solution, but I think, there is just something missing in the make.inc file?

Posted: **Fri Jan 06, 2017 11:21 pm**

I am now attaching a more detailed output of ./run_tests.py

There were 5k failed tests over 150k passed, this is I can say remarkable passed tests.

Here is a short summary of failed tests:

1. testing_zgemv on 600x1 matrix
2. testing_*trmv on CUBLAS error
3. testing_*trsm on LAPACK error

However, I could not completely finish all tests as it is taking too long! If you would like I can continue the tests but I dont know how to start it were it left off.

Posted: **Sat Jan 07, 2017 3:55 pm**

Thanks.
The output is a bit garbled in places from mixing stdout and stderr. For future reference, you can redirect output into a file which should avoid that issue. You can also select smaller tests to run it faster. The default is --small --medium --large (-s -m -l).

Code: Select all

run_tests.py -s -m > results.txt

Mostly, the "failures" are caused by having the tolerance a bit too low. The default is 30. Using 100 will eliminate a lot of these issues. A few routines -- notably trsm -- don't have very tight error bounds yet, so may require a higher tolerance than that even.

Code: Select all

run_tests.py -s -m --tol 100 > results.txt

Fortunately, you can see what the results would be with a different tolerance without re-running them. Use

Code: Select all

run_summarize.py --tol 100 lapackerrors.txt > results100.txt
run_summarize.py --tol 200 lapackerrors.txt > results200.txt

This does several things:

Finds errors like "3.34e-06" and adds error/eps after it in { } braces, like "3.34e-06 { 56.0}". That (error/eps) number is what is tested against tolerance. So in this case, 56.0 > 30, the default tolerance, so it would fail, but it's less than 100.
Changes "failed" to "suspect" if all the (error/eps) are less than the new tolerance.
Sorts failures into categories: okay, errors (segfaults), failed, suspicious, known failures. Most of the failures that you observed are in the known failures, and come from 4 routines: trsm, gesv_rbt, geqr2x version 2 and 4, and gegqr. We need to fix the error check for trsm. See BUGS.txt about others.

There are a few errors to look into here. zheevd version 3, which is actually zheevr (MRRR) seems to have some issues. zgemv had one weird error.

-mark

Posted: **Sat Jan 07, 2017 4:06 pm**

Also, if you are interested, to restart it near where it left off, use the --start option.

Code: Select all

run_tests.py --start testing_zhetrd

I usually run smaller groups of routines together, e.g.,

Code: Select all

run_tests.py --blas > blas.txt
run_tests.py --aux > aux.txt
run_tests.py --chol > chol.txt

and so on.

-mark

MAGMA Forum

Lapack test failed in Magma 2.2

Lapack test failed in Magma 2.2

Re: Lapack test failed in Magma 2.2

Re: Lapack test failed in Magma 2.2

Re: Lapack test failed in Magma 2.2

Re: Lapack test failed in Magma 2.2

Re: Lapack test failed in Magma 2.2