Lapack test failed in Magma 2.2
-
organicchemistry_01
- Posts: 4
- Joined: Mon Dec 26, 2016 2:21 am
Lapack test failed in Magma 2.2
I followed the make example for make.inc-mkl-gcc in a system with
Pascal GPU gtx 1060
MKL 2017
Cuda 8.0
Gcc 4.9
Dual intel xeon (32 logic cores)
I run test provided from magma source dir and got about 10x times performance on cuBlas compared to cpu blas, all good but when it reached the lapack testing I got lots of failed result that seem to occur regardless of matrix size, some big matrix passed but mostly fails. 1/4 of lapack testing fails.
So i thought maybe it was just mkl problem, so I switch to OpenBlas+Gcc4.9+Cuda8, I use the openblas gcc make.inc of course, however I got the same results. Lapack failure occurs exactly were magma-mkl fails.
I couldnt see any problem in the supplied make.inc examples as all cuBlas related test passed graciously with flying colors however fails 1/4 of it in lapack tests on either mkl or openblas, how could this be resolved?
Pascal GPU gtx 1060
MKL 2017
Cuda 8.0
Gcc 4.9
Dual intel xeon (32 logic cores)
I run test provided from magma source dir and got about 10x times performance on cuBlas compared to cpu blas, all good but when it reached the lapack testing I got lots of failed result that seem to occur regardless of matrix size, some big matrix passed but mostly fails. 1/4 of lapack testing fails.
So i thought maybe it was just mkl problem, so I switch to OpenBlas+Gcc4.9+Cuda8, I use the openblas gcc make.inc of course, however I got the same results. Lapack failure occurs exactly were magma-mkl fails.
I couldnt see any problem in the supplied make.inc examples as all cuBlas related test passed graciously with flying colors however fails 1/4 of it in lapack tests on either mkl or openblas, how could this be resolved?
Re: Lapack test failed in Magma 2.2
Which routines passed & which failed? Can you post failures of some routines? Please include the complete input & output so we know what command line you used. Please also include your make.inc file and any environment variables you set (e.g., CUDADIR, GPU_TARGET).
I assume this is on Linux?
-mark
I assume this is on Linux?
-mark
-
organicchemistry_01
- Posts: 4
- Joined: Mon Dec 26, 2016 2:21 am
Re: Lapack test failed in Magma 2.2
Hi,
Yes this is on Linux,
The command line I used for testing is
in the testing directory
Make file is make.inc.mkl-gcc, I did not change anything there except for Paths and GPU_TARGET, here is it
I did the same for make.inc.openblas but updated the directories and nvcc path.
I wish to attach the ./run_tests.py output but our workstations are down now due to December/January maintenance, there is nothing unusuall in the table outputed by the ./run_tests.py its just all cuBlas related error check passed but when It reach the lapack error check (last column) there were too much test failed.
As I can remember its those tests involving lapack error check as the last column from
however, not all steps have failures, but mostly those involving bigger matrix. Failed tests never stops coming out when it reached those routines so I find it unusual as compared to cuBlas error check that passed all, I terminated it before it could reach other routines
I hope that this could help us in coming up for a possible solution, but I think, there is just something missing in the make.inc file?
Yes this is on Linux,
The command line I used for testing is
Code: Select all
./run_tests.pyMake file is make.inc.mkl-gcc, I did not change anything there except for Paths and GPU_TARGET, here is it
Code: Select all
GPU_TARGET ?= Pascal
CC = gcc
CXX = g++
NVCC = /usr/local/cuda/bin/nvcc
FORT = gfortran
ARCH = ar
ARCHFLAGS = cr
RANLIB = ranlib
FPIC = -fPIC
CFLAGS = -O3 $(FPIC) -fopenmp -DNDEBUG -DADD_ -Wall -Wshadow -DMAGMA_WITH_MKL
FFLAGS = -O3 $(FPIC) -DNDEBUG -DADD_ -Wall -Wno-unused-dummy-argument
F90FLAGS = -O3 $(FPIC) -DNDEBUG -DADD_ -Wall -Wno-unused-dummy-argument -x f95-cpp-input
NVCCFLAGS = -O3 -DNDEBUG -DADD_ -Xcompiler "$(FPIC) -Wall -Wno-unused-function"
LDFLAGS = $(FPIC) -fopenmp
CXXFLAGS := $(CFLAGS) -std=c++11
CFLAGS += -std=c99
LIB = -lmkl_gf_lp64 -lmkl_gnu_thread -lmkl_core -lpthread -lstdc++ -lm -lgfortran
LIB += -lcublas -lcusparse -lcudart -lcudadevrt
MKLROOT ?= /opt/intel/mkl
CUDADIR ?= /usr/local/cuda
-include make.check-mkl
-include make.check-cuda
LIBDIR = -L$(CUDADIR)/lib64 \
-L$(MKLROOT)/lib/intel64
INC = -I$(CUDADIR)/include \
-I$(MKLROOT)/includeI wish to attach the ./run_tests.py output but our workstations are down now due to December/January maintenance, there is nothing unusuall in the table outputed by the ./run_tests.py its just all cuBlas related error check passed but when It reach the lapack error check (last column) there were too much test failed.
As I can remember its those tests involving lapack error check as the last column from
Code: Select all
testing_c**** routineI hope that this could help us in coming up for a possible solution, but I think, there is just something missing in the make.inc file?
-
organicchemistry_01
- Posts: 4
- Joined: Mon Dec 26, 2016 2:21 am
Re: Lapack test failed in Magma 2.2
I am now attaching a more detailed output of ./run_tests.py
There were 5k failed tests over 150k passed, this is I can say remarkable passed tests.
Here is a short summary of failed tests:
1. testing_zgemv on 600x1 matrix
2. testing_*trmv on CUBLAS error
3. testing_*trsm on LAPACK error
However, I could not completely finish all tests as it is taking too long! If you would like I can continue the tests but I dont know how to start it were it left off.
There were 5k failed tests over 150k passed, this is I can say remarkable passed tests.
Here is a short summary of failed tests:
1. testing_zgemv on 600x1 matrix
2. testing_*trmv on CUBLAS error
3. testing_*trsm on LAPACK error
However, I could not completely finish all tests as it is taking too long! If you would like I can continue the tests but I dont know how to start it were it left off.
- Attachments
-
- lapackerrors.tar.gz
- Magma 2.2 failed tests
- (1.96 MiB) Downloaded 148 times
Re: Lapack test failed in Magma 2.2
Thanks.
The output is a bit garbled in places from mixing stdout and stderr. For future reference, you can redirect output into a file which should avoid that issue. You can also select smaller tests to run it faster. The default is --small --medium --large (-s -m -l).
Mostly, the "failures" are caused by having the tolerance a bit too low. The default is 30. Using 100 will eliminate a lot of these issues. A few routines -- notably trsm -- don't have very tight error bounds yet, so may require a higher tolerance than that even.
Fortunately, you can see what the results would be with a different tolerance without re-running them. Use
This does several things:
-mark
The output is a bit garbled in places from mixing stdout and stderr. For future reference, you can redirect output into a file which should avoid that issue. You can also select smaller tests to run it faster. The default is --small --medium --large (-s -m -l).
Code: Select all
run_tests.py -s -m > results.txt
Code: Select all
run_tests.py -s -m --tol 100 > results.txt
Code: Select all
run_summarize.py --tol 100 lapackerrors.txt > results100.txt
run_summarize.py --tol 200 lapackerrors.txt > results200.txt
- Finds errors like "3.34e-06" and adds error/eps after it in { } braces, like "3.34e-06 { 56.0}". That (error/eps) number is what is tested against tolerance. So in this case, 56.0 > 30, the default tolerance, so it would fail, but it's less than 100.
- Changes "failed" to "suspect" if all the (error/eps) are less than the new tolerance.
- Sorts failures into categories: okay, errors (segfaults), failed, suspicious, known failures. Most of the failures that you observed are in the known failures, and come from 4 routines: trsm, gesv_rbt, geqr2x version 2 and 4, and gegqr. We need to fix the error check for trsm. See BUGS.txt about others.
-mark
Re: Lapack test failed in Magma 2.2
Also, if you are interested, to restart it near where it left off, use the --start option.
I usually run smaller groups of routines together, e.g.,
and so on.
-mark
Code: Select all
run_tests.py --start testing_zhetrd
Code: Select all
run_tests.py --blas > blas.txt
run_tests.py --aux > aux.txt
run_tests.py --chol > chol.txt
-mark