MAGMA version 0.1 Released

Post by **admin** » Wed Aug 05, 2009 1:01 pm

MAGMA version 0.1 for 32 and 64-bit Linux is now available...

See Software section for download link:
http://icl.cs.utk.edu/magma/software/

For more information visit the MAGMA web site:
http://icl.cs.utk.edu/magma/

Please use this forum for questions and comments in regards to MAGMA.

Best regards,

mbibby · Post by **mbibby** » Fri Aug 07, 2009 9:11 am

As you undoubtedly know, much scientific/technical work is done with complex quantities. Hence, the complex versions of the codes that you have just released will be very welcome

Thanks for the good work.

Malcolm

Post by **admin** » Sat Aug 08, 2009 4:42 pm

Thanks for bringing this up. Complex versions are high in our priority to add. Actually we have them implemented
on the "high" level of the other versions (we generate the different precision almost automatically) but we don't
have yet the complex CUDA BLAS that is needed, e.g. complex versions of syrk, trmm, trsm. We are checking
with NVIDIA on this, and are considering a MAGMA implementation as well.

Does anybody in the community already have and may be willing to contribute these routines
to the MAGMA project?

Regards,
Stan Tomov

nguyen · Post by **nguyen** » Mon Sep 07, 2009 5:45 am

Thank you for the hard work on this library.
I've just built MAGMA and run some tests on GTX 285. Undoubtedly, functions referred to the GPU interface offer better GPU's performance than the same ones in the CPU interface's case (e.g.: sgetrf and sgetrf_gpu). Is that because the time to exchange data between CPU's memory and GPU's memory in the CPU's interface is bigger than in the GPU's interface?
Also, when handling computing on CPU and GPU at the same time by using MAGMA, can you explain how you divide the data to handle on them?
Thanks
Nguyen

For reference, below is my test result of sgetrf and sgetrf_gpu:

Code: Select all

./testing_sgetrf
device 0: GeForce GTX 285, 1476.0 MHz clock, 1023.3 MB memory
device 1: GeForce GTX 285, 1476.0 MHz clock, 1023.8 MB memory

Usage: 
  testing_sgetrf -N 1024

  N    CPU GFlop/s    GPU GFlop/s    ||PA-LU|| / (||A||*N)
==========================================================
 1024     13.89          43.79         2.049213e-09
 2048     26.16         101.55         1.924833e-09
 3072     30.91         158.78         1.918028e-09
 4032     37.87         210.94         1.860973e-09
 5184     42.00         240.96         1.840821e-09
 6016     45.24         261.49         1.836232e-09
 7040     47.89         280.14         1.826794e-09
 8064     50.43         294.79         1.931242e-09
 9088     52.48         305.72         2.152791e-09
10112     54.43         315.43         2.341549e-09

./testing_sgetrf_gpu
device 0: GeForce GTX 285, 1476.0 MHz clock, 1023.3 MB memory
device 1: GeForce GTX 285, 1476.0 MHz clock, 1023.8 MB memory

Usage: 
  testing_sgetrf_gpu -N 1024

  N    CPU GFlop/s    GPU GFlop/s    ||PA-LU|| / (||A||*N)
==========================================================
 1024     13.31          48.41         2.049213e-09
 2048     25.94         114.76         1.924833e-09
 3072     30.91         179.62         1.918028e-09
 4032     38.08         233.55         1.860973e-09
 5184     42.56         270.25         1.840821e-09
 6016     45.39         290.48         1.836232e-09
 7040     48.21         306.81         1.826794e-09
 8064     50.67         319.91         1.931242e-09
 9088     52.80         330.09         2.152791e-09
10112     54.64         338.68         2.341549e-09

Stan Tomov · Post by **Stan Tomov** » Thu Sep 10, 2009 1:00 pm

Thanks for trying out MAGMA and the input. The GTX 285 results look impressive!

the GPU interface offer better GPU's performance than the same ones in the CPU interface's case (e.g.: sgetrf and sgetrf_gpu). Is that because the time to exchange data between CPU's memory and GPU's memory in the CPU's interface is bigger than in the GPU's interface?

Briefly, yes. As most of the computation is done on the GPU, to minimize the communications, the matrix to be factored has to mostly reside on the GPU memory. In the CPU interface the matrix starts from the CPU and the result is expected to be on the CPU, so an overhead of copying the original matrix to the GPU and bringing the result back to the CPU is to be expected. For some algorithms, like QR for example, we can better intermix computation and communication and hide some of this overhead.

Also, when handling computing on CPU and GPU at the same time by using MAGMA, can you explain how you divide the data to handle on them?

There are of course variations for the different algorithms, but in general, if we look at Figure 4 and the notations there, the panel A1 has to be factored and A2 updated. For the one-sided factorizations that are currently in MAGMA no more data than A1 is needed in order to factor it, so A1 is sent to the CPU and factored there. This is overlapped with updating A2 (from previous iterations) on the GPU. More on this can be found in

Tomov, S., Dongarra, J., Baboulin, M. Towards Dense Linear Algebra for Hybrid GPU Accelerated Manycore Systems, LAPACK Working Note 210, October 17, 2008.

for the one-sided factorizations and in

Tomov, S., Dongarra, J. Accelerating the reduction to upper Hessenberg form through hybrid GPU-based computing, LAPACK Working Note 219, May 24, 2009.

for the two-sided.

Regards,
Stan Tomov

nguyen · Post by **nguyen** » Thu Sep 17, 2009 8:25 pm

Thank you very much for your detailed explanation. This would help me a lot.
Regards,
Nguyen

MAGMA Forum

MAGMA version 0.1 Released

MAGMA version 0.1 Released

Re: MAGMA version 0.1 Released

Re: MAGMA version 0.1 Released

Re: MAGMA version 0.1 Released

Re: MAGMA version 0.1 Released

Re: MAGMA version 0.1 Released