Gemm shared memory allocation
Posted: Thu Jan 12, 2017 4:27 pm
I have some issues understanding the following segment from the gemm_template_device_tn kernel ( gemm_template_device.c file):
__shared__ T sA[BLK_K][BLK_M+1]; // +1 only required if A is transposed
__shared__ T sB[BLK_N][BLK_K+1]; // +1 always required
Since i checked that every single kernel in the file contains the +1 offset and by removing it makes no difference (granted not out of bounds indexing) noticed a big performance penalty when running it. Compiling the MAGMA source with nvcc and -arch=sm_20 parameter had got the following output:
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 56 registers, 12800 bytes smem, 64 bytes cmem[0]
Whereas compiling the modified code (remove +1 offset) i got:
16 bytes stack frame, 12 bytes spill stores, 12 bytes spill loads
ptxas info : Used 63 registers, 12288 bytes smem, 64 bytes cmem[0]
Could you please give us some insight regarding what does +1 really do, in the code?
Thank you for your time !
__shared__ T sA[BLK_K][BLK_M+1]; // +1 only required if A is transposed
__shared__ T sB[BLK_N][BLK_K+1]; // +1 always required
Since i checked that every single kernel in the file contains the +1 offset and by removing it makes no difference (granted not out of bounds indexing) noticed a big performance penalty when running it. Compiling the MAGMA source with nvcc and -arch=sm_20 parameter had got the following output:
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 56 registers, 12800 bytes smem, 64 bytes cmem[0]
Whereas compiling the modified code (remove +1 offset) i got:
16 bytes stack frame, 12 bytes spill stores, 12 bytes spill loads
ptxas info : Used 63 registers, 12288 bytes smem, 64 bytes cmem[0]
Could you please give us some insight regarding what does +1 really do, in the code?
Thank you for your time !