Accelerating GPU Kernels for Dense Linear Algebra