Using sgemm for rectangular (non-square) matrix multiply

emscheffel · Post by **emscheffel** » Mon Apr 09, 2012 3:09 am

Dear All,

I am fairly new to GPU computing using CUDA but have gathered a lot of information from the internet over the last month or so. I am an academic researcher who has mostly been programming computational routines in Python, due to the simplicity of scripting languages (of course I know, runtime speed kind of sucks for more complex problems...). I have identified CUDA GPGPU (and thus potentially also MAGMA) as a great way of "outsourcing" certain hotspot code fragments to the GPU. In particular I am thinking about cases such as bootstrapping techniques in econometrics and Monte Carlo integration in Finance, where in the former case I may have to solve a linear algebra problem many thousand times over, which I want to do in parallel rather than serially.

There is one problem I have been experiencing, which is that many CUDA-optimized routines - such as SGEMM in CUBLAS - appear only to work well with square matrices. Is this the same with MAGMA SGEMM as well? All I want is a general, "fire-and-forget", but very fast GPU SGEMM version for a GENERAL matrix multiplication problem, where the matrices are [p,q] x [q,r] where p neq q neq r. I think the best option in my case would be to use PyCUDA actually, because I employ Python as my main language, but to my surprise I also still have not found a working source code for a kernel which delivers a fool-proof solution to my problem. Could anybody provide me with a hint as to what approach I ought to take? I have played around a lot with Vasily Volkov's fairly new SGEMM kernel source code, but again, for me so far this has only worked for the special case of square matrices. I am 100% certain that the alternative "solution" of applying padding of zeros to general problems to make them all square cannot be answer to what I am looking for :-)

Thanks,
Eric

mgates3 · Post by **mgates3** » Fri May 04, 2012 10:43 am

The gemm in CUBLAS (and in MAGMA BLAS) works for rectangular matrices. As far as I know, all the CUBLAS work for all matrix sizes. If you are seeing some problems, please report the specific case. A short sample code would be helpful.
-mark

Anitha · Post by **Anitha** » Thu Oct 04, 2012 2:45 am

Hello,
I did non - square matrix multiplication by using cublasSgemm, I am getting the wrong output. I m very new to use those library functions. Please any one help me.
I used "cublasSgemm('n','n',N1,M2,M1,alpha,d_a,M1,d_b,M1,beta,d_c,N1)"
int N1 = 2; // row of matrix1
int M1 = 3; // col of matrix1
int N2 = 3; // row of matrix2
int M2 = 2; // col of matrix2

Thanks in advance..

Anitha · Post by **Anitha** » Thu Oct 04, 2012 3:04 am

Hello,
I m very new to use these library files. I did non - square matrix multiplication using cublasSgemm function, but am getting the wrong ouput. Please any one help me.
My sample code :
"cublasSgemm('n','n',N1,M2,M1,alpha,d_a,M1,d_b,M1,beta,d_c,N1)"

int N1 = 2; // row of matrix1
int M1 = 3; // col of matrix1
int N2 = 3; // row of matrix2
int M2 = 2; // col of matrix2

Thanks in advance..

mgates3 · Post by **mgates3** » Thu Oct 04, 2012 1:01 pm

It will help if you use the BLAS m, n, k convention. For C := alpha A B + beta C
C is m x n
A is m x k
B is k x n

It appears your leading dimensions may be wrong. The leading dimension (lda, ldb, ldc) is the number of rows allocated for the matrix, which may be greater than the number of rows in the gemm. If the gemm is the whole matrix without padding, then lda=m, ldb=k, and ldc=m. You can of course add padding (extra unused rows on bottom), so lda >= m, ldb >= k, ldc >= m. We usually pad the number of rows to a multiple of 32 on the GPU. See magma/testing/testing_sgemm.cpp for sample code.

cublasSgemm( 'n', 'n', m, n, k, alpha, dA, lda, dB, ldb, beta, dC, ldc );

You used N for rows and M for columns, which is the opposite of the normal matrix convention (m rows x n cols). In your notation,
lda = N1;
ldb = N2; // == M1, the inner, k, dimension
ldc = N1;
cublasSgemm( 'n', 'n', N1, M2, M1, alpha, dA, lda /*N1*/, dB, ldb /*N2*/, dC, ldc /*N1*/ );
In particular, the lda you passed was M1, not N1.

Hope that helps. If you still have problems, please post a more complete sample code. I can't tell how you are allocating the matrices, putting data onto the GPU, getting results from the GPU, or what your expected and actual outputs are.

-mark

Anitha · Post by **Anitha** » Fri Oct 05, 2012 12:36 am

Thanks for your reply Sir, I made a mistake. Now i got correct output.

soccon2007 · Post by **soccon2007** » Thu Aug 15, 2019 1:37 am

Hi all, I have a below code. In my code, I want to replace cublasSgemm function by my code. But when I ran, I got a wrong results. If you know how to solve my problem, please show me. Thanks very much.

Code: Select all

void caffe_gpu_gemm<float>(const CBLAS_TRANSPOSE TransA,
    const CBLAS_TRANSPOSE TransB, const int M, const int N, const int K,
    const float alpha, const float* A, const float* B, const float beta,
    float* C) {
  // Note that cublas follows fortran order.
  int lda = (TransA == CblasNoTrans) ? K : M;
  int ldb = (TransB == CblasNoTrans) ? N : K;
  cublasOperation_t cuTransA =
      (TransA == CblasNoTrans) ? CUBLAS_OP_N : CUBLAS_OP_T;
  cublasOperation_t cuTransB =
      (TransB == CblasNoTrans) ? CUBLAS_OP_N : CUBLAS_OP_T;
 // CUBLAS_CHECK(cublasSgemm(Caffe::cublas_handle(), cuTransB, cuTransA,
  //    N, M, K, &alpha, B, ldb, A, lda, &beta, C, N));
////
cudaDeviceSynchronize();

  float* A_cpu = (float *)malloc(K*M*sizeof(float));
  float* B_cpu = (float *)malloc(N*K*sizeof (float));
  float* C_cpu = (float *)malloc (N*M*sizeof (float));
  float * D_cpu = (float *)malloc (N*M*sizeof (float));
  cudaMemcpy( A_cpu, A, K*M, cudaMemcpyDeviceToHost);
  cudaMemcpy( B_cpu, B, N*K, cudaMemcpyDeviceToHost);
  cudaMemcpy( D_cpu, C, N*M, cudaMemcpyDeviceToHost);
  
	  for (int row = 0; row < N; row++) {
	 	for (int col = 0; col < M; col++) {
	 		 for (int i = 0; i < K; i++) {
	 			if (i == 0)
	 				  C_cpu[row*M + col] =  alpha* (B_cpu[row*K + i] * A_cpu[i*M+ col]) +beta* D_cpu[row*M+col] ;
	 			else C_cpu[row*M + col] +=  alpha* (B_cpu[row*K + i] * A_cpu[i*M+ col]) +beta* D_cpu[row*M+col] ;		
        
	 		}
	 	 }
	 }  
  cudaMemcpy(C,C_cpu, N*M, cudaMemcpyHostToDevice);
 printf ("da lam xong"); 
/////
}

Diem

mgates3 · Post by **mgates3** » Fri Aug 16, 2019 7:56 am

First, in the future, please start a new topic, rather than replying to a 7 year old topic.

Your variable names seem mixed up: there are M rows and N cols in C. You are adding beta*C multiple times. Normally I use i for rows, j for cols. Unfortunately k is used for dimension, so we need something else for that loop counter. Also, use lda, ldb, ldc for matrix strides for clarity and flexibility. I recommend looking at the reference BLAS (http://www.netlib.org/lapack/) for naive, but correct, dgemm implementations, and at magmablas for more optimized, complicated implementations.

Code: Select all

// Naive, slow implementation.
// NoTrans, NoTrans case.
for (j = 0; j < n; ++j) {
    for (i = 0; i < m; ++i) {
        tmp = 0;
        for (kk = 0; kk < k; ++kk) {
            tmp += A[ i + kk*lda ] * B[ kk + j*ldb ];
        }
        C[ i + j*ldc ] = tmp + beta*C[ i + j*ldc ];
    }
}

-mark

MAGMA Forum

Using sgemm for rectangular (non-square) matrix multiply

Using sgemm for rectangular (non-square) matrix multiply

Re: Using sgemm for rectangular (non-square) matrix multiply

Re: Using sgemm for rectangular (non-square) matrix multiply

Re: Using sgemm for rectangular (non-square) matrix multiply

Re: Using sgemm for rectangular (non-square) matrix multiply

Re: Using sgemm for rectangular (non-square) matrix multiply

Re: Using sgemm for rectangular (non-square) matrix multiply

Re: Using sgemm for rectangular (non-square) matrix multiply