1) What is timer resolution used in testing programs (from MAGMA testing directory)
for Linux/x86-64 ?
I obtained some (may be not nonsense) GFLOPS values for C2050 with testing_dgemm run where
small 32 x 32 matrices were used. But like dgemm call requires only something about 1E-5 sec for
one Nehalem/2.7 Ghz core.
2) Do I understand correctly that there are start-stop timings at x86 host side, i.e.
(for example) testing_dgemm printed runtimes (and performance itself) include PCIe transfers delays ?
2) Do I need to perform MAGMA performance tuning (based onsrc/get_nb.cpp)
after 1.0.0-rc4 installation for NVidia С2050 ?
time measurements in MAGMA
Re: time measurements in MAGMA
I found this paper by Orion Lawlor very helpful to get an understanding of the timing of data transfers to a GPU
http://lawlor.cs.uaf.edu/~olawlor/paper ... I_2009.pdf
There is a set up time for a transfer which will dominate for small transfers.
I hope this helps
John
http://lawlor.cs.uaf.edu/~olawlor/paper ... I_2009.pdf
There is a set up time for a transfer which will dominate for small transfers.
I hope this helps
John
Re: time measurements in MAGMA
Thanks for your reference, it's very helpful !
But the question was a bit more stupid: what measures get_current_time call and what is therefore time difference
between end and start time (does it include PCI transfer time, as I beleive) ? And what is about get_current_time resolution ?
But the question was a bit more stupid: what measures get_current_time call and what is therefore time difference
between end and start time (does it include PCI transfer time, as I beleive) ? And what is about get_current_time resolution ?
-
Stan Tomov
- Posts: 283
- Joined: Fri Aug 21, 2009 10:39 pm
Re: time measurements in MAGMA
Function get_current_time calls gettimeofday, so the resolution is a microsecond. Before getting the gettimeofday there is a call to cudaThreadSynchronize() to make sure previous GPU tasks have completed. Thus one can measure the time of a particular GPU kernel by surrounding it by calls to get_current_time. If between two get_current_time calls there are functions transferring data, the time measure will include the time for the transfer.
Re: time measurements in MAGMA
ОК.
I looked again testing_magma source. It looks that magma_dgemm itself includes host-device data exchanges (right?).
There is also cuBLASSetMatrix calls before magma_dgemm call. Do I understand correctly that this calls performs arrays allocation in GPU global memory and therefore are negligible for total exeсution time even for 32х32 matrices ?
I.e. is it correct to compare directly (GPU vs CPU) via comparison of (testing_dgemm execution time vs usual dgemm
execution time) ?
To be more exact: instead of execution time I use testing_dgemm GFLOPS value etc.
I looked again testing_magma source. It looks that magma_dgemm itself includes host-device data exchanges (right?).
There is also cuBLASSetMatrix calls before magma_dgemm call. Do I understand correctly that this calls performs arrays allocation in GPU global memory and therefore are negligible for total exeсution time even for 32х32 matrices ?
I.e. is it correct to compare directly (GPU vs CPU) via comparison of (testing_dgemm execution time vs usual dgemm
execution time) ?
To be more exact: instead of execution time I use testing_dgemm GFLOPS value etc.
-
Stan Tomov
- Posts: 283
- Joined: Fri Aug 21, 2009 10:39 pm
Re: time measurements in MAGMA
No. We measure the time for dgemm on the GPU, i.e., we assume the data and the result will be on the GPU memory.It looks that magma_dgemm itself includes host-device data exchanges (right?).
This call is not allocating memory. The memory allocation is before. This call only sets up the matrix values in the GPU memory by copying them from the CPU memory. The transfer of a 32x32 matrix will be significant time of the magma_dgemm execution.There is also cuBLASSetMatrix calls before magma_dgemm call. Do I understand correctly that this calls performs arrays allocation in GPU global memory and therefore are negligible for total exeсution time even for 32х32 matrices ?
It will depend on what you need to accelerate. If you have the matrix on the CPU, want the result on the CPU as well, and want to check if you can accelerate this using a GPU, you must modify the testing_dgemm code to include the memory transfers. The current MAGMA GEMM is an optimized implementation of DGEMM for GPU where the inputs and the output is on the GPU. A CPU interface GEMM must be hybrd, taking into account transfer times, and the CPU and GPU computational power, e.g., seeis it correct to compare directly (GPU vs CPU) via comparison of (testing_dgemm execution time vs usual dgemm execution time) ?
Massimiliano Fatica. 2009. Accelerating linpack with CUDA on heterogenous clusters. In Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units (GPGPU-2). ACM, New York, NY, USA, 46-51.