Page 1 of 1
Lapack test failed in Magma 2.2
Posted: Mon Dec 26, 2016 3:14 am
by organicchemistry_01
I followed the make example for make.inc-mkl-gcc in a system with
Pascal GPU gtx 1060
MKL 2017
Cuda 8.0
Gcc 4.9
Dual intel xeon (32 logic cores)
I run test provided from magma source dir and got about 10x times performance on cuBlas compared to cpu blas, all good but when it reached the lapack testing I got lots of failed result that seem to occur regardless of matrix size, some big matrix passed but mostly fails. 1/4 of lapack testing fails.
So i thought maybe it was just mkl problem, so I switch to OpenBlas+Gcc4.9+Cuda8, I use the openblas gcc make.inc of course, however I got the same results. Lapack failure occurs exactly were magma-mkl fails.
I couldnt see any problem in the supplied make.inc examples as all cuBlas related test passed graciously with flying colors however fails 1/4 of it in lapack tests on either mkl or openblas, how could this be resolved?
Re: Lapack test failed in Magma 2.2
Posted: Wed Dec 28, 2016 3:07 pm
by mgates3
Which routines passed & which failed? Can you post failures of some routines? Please include the complete input & output so we know what command line you used. Please also include your make.inc file and any environment variables you set (e.g., CUDADIR, GPU_TARGET).
I assume this is on Linux?
-mark
Re: Lapack test failed in Magma 2.2
Posted: Thu Dec 29, 2016 5:03 pm
by organicchemistry_01
Hi,
Yes this is on Linux,
The command line I used for testing is
in the testing directory
Make file is make.inc.mkl-gcc, I did not change anything there except for Paths and GPU_TARGET, here is it
Code: Select all
GPU_TARGET ?= Pascal
CC = gcc
CXX = g++
NVCC = /usr/local/cuda/bin/nvcc
FORT = gfortran
ARCH = ar
ARCHFLAGS = cr
RANLIB = ranlib
FPIC = -fPIC
CFLAGS = -O3 $(FPIC) -fopenmp -DNDEBUG -DADD_ -Wall -Wshadow -DMAGMA_WITH_MKL
FFLAGS = -O3 $(FPIC) -DNDEBUG -DADD_ -Wall -Wno-unused-dummy-argument
F90FLAGS = -O3 $(FPIC) -DNDEBUG -DADD_ -Wall -Wno-unused-dummy-argument -x f95-cpp-input
NVCCFLAGS = -O3 -DNDEBUG -DADD_ -Xcompiler "$(FPIC) -Wall -Wno-unused-function"
LDFLAGS = $(FPIC) -fopenmp
CXXFLAGS := $(CFLAGS) -std=c++11
CFLAGS += -std=c99
LIB = -lmkl_gf_lp64 -lmkl_gnu_thread -lmkl_core -lpthread -lstdc++ -lm -lgfortran
LIB += -lcublas -lcusparse -lcudart -lcudadevrt
MKLROOT ?= /opt/intel/mkl
CUDADIR ?= /usr/local/cuda
-include make.check-mkl
-include make.check-cuda
LIBDIR = -L$(CUDADIR)/lib64 \
-L$(MKLROOT)/lib/intel64
INC = -I$(CUDADIR)/include \
-I$(MKLROOT)/include
I did the same for make.inc.openblas but updated the directories and nvcc path.
I wish to attach the ./run_tests.py output but our workstations are down now due to December/January maintenance, there is nothing unusuall in the table outputed by the ./run_tests.py its just all cuBlas related error check
passed but when It reach the lapack error check (last column) there were too much test failed.
As I can remember its those tests involving
lapack error check as the last column from
however, not all steps have failures, but mostly those involving bigger matrix. Failed tests never stops coming out when it reached those routines so I find it unusual as compared to cuBlas error check that passed all, I terminated it before it could reach other routines
I hope that this could help us in coming up for a possible solution, but I think, there is just something missing in the make.inc file?
Re: Lapack test failed in Magma 2.2
Posted: Fri Jan 06, 2017 11:21 pm
by organicchemistry_01
I am now attaching a more detailed output of ./run_tests.py
There were 5k failed tests over 150k passed, this is I can say remarkable passed tests.
Here is a short summary of failed tests:
1. testing_zgemv on 600x1 matrix
2. testing_*trmv on CUBLAS error
3. testing_*trsm on LAPACK error
However, I could not completely finish all tests as it is taking too long! If you would like I can continue the tests but I dont know how to start it were it left off.
Re: Lapack test failed in Magma 2.2
Posted: Sat Jan 07, 2017 3:55 pm
by mgates3
Thanks.
The output is a bit garbled in places from mixing stdout and stderr. For future reference, you can redirect output into a file which should avoid that issue. You can also select smaller tests to run it faster. The default is --small --medium --large (-s -m -l).
Mostly, the "failures" are caused by having the tolerance a bit too low. The default is 30. Using 100 will eliminate a lot of these issues. A few routines -- notably trsm -- don't have very tight error bounds yet, so may require a higher tolerance than that even.
Code: Select all
run_tests.py -s -m --tol 100 > results.txt
Fortunately, you can see what the results would be with a different tolerance
without re-running them. Use
Code: Select all
run_summarize.py --tol 100 lapackerrors.txt > results100.txt
run_summarize.py --tol 200 lapackerrors.txt > results200.txt
This does several things:
- Finds errors like "3.34e-06" and adds error/eps after it in { } braces, like "3.34e-06 { 56.0}". That (error/eps) number is what is tested against tolerance. So in this case, 56.0 > 30, the default tolerance, so it would fail, but it's less than 100.
- Changes "failed" to "suspect" if all the (error/eps) are less than the new tolerance.
- Sorts failures into categories: okay, errors (segfaults), failed, suspicious, known failures. Most of the failures that you observed are in the known failures, and come from 4 routines: trsm, gesv_rbt, geqr2x version 2 and 4, and gegqr. We need to fix the error check for trsm. See BUGS.txt about others.
There are a few errors to look into here. zheevd version 3, which is actually zheevr (MRRR) seems to have some issues. zgemv had one weird error.
-mark
Re: Lapack test failed in Magma 2.2
Posted: Sat Jan 07, 2017 4:06 pm
by mgates3
Also, if you are interested, to restart it near where it left off, use the --start option.
Code: Select all
run_tests.py --start testing_zhetrd
I usually run smaller groups of routines together, e.g.,
Code: Select all
run_tests.py --blas > blas.txt
run_tests.py --aux > aux.txt
run_tests.py --chol > chol.txt
and so on.
-mark