lundi 29 juin 2015

Extracting matrix columns with CUDA?


Using nvprof, I found out that the following kernel is the bottleneck of my CUDA application

__global__ void extractColumn_kernel(real_t *tgt, real_t *src, int *indices, int numRows, int len) {
  int stride = gridDim.x * blockDim.x;
  int tid = blockDim.x * blockIdx.x + threadIdx.x;
  for (int j = tid; j < len; j += stride) {
    int colId = j / numRows;
    int rowId = j % numRows;
    tgt[j] = src[indices[colId]*numRows + rowId];
  }
}

It intends to fetch the columns of the matrix src, listed in indices, into the matrix tgt. Note that the matrices src and tgt both have numRows rows, and are stored with column-major dimension. Also, len = length(indices)*numRows is the total number of entries of the matrix tgt.

My question: is there a more efficient way to do this? Reference to older questions is also appreciated. I am surprised that I couldn't find this question asked before, as it is the very common operation tgt = src(:,indices(:)); used in MATLAB.

Thanks a ton!


Aucun commentaire:

Enregistrer un commentaire