本文介绍了具有不同输入的大型内核的智能设计,只需更改一行代码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在设计一些我希望有两种调用方式的内核:一次使用标准 float * 设备作为输入(用于编写),另一种使用 cudaSurfaceObject_t 作为输入(用于编写).内核本身很长(> 200行),最终,我只需要最后一行有所不同即可.在一种情况下,您具有标准的 out [idx] = val 类型分配,而在另一种情况下,您具有 surf3Dwrite()类型.内核的其余部分是相同的.

I am designing some kernels that I would like to have 2 ways of calling: Once with standard float * device as input (for writing), and another with cudaSurfaceObject_t as input (for writing). The kernel itself is long (>200 lines) and ultimately, I only need the last line to be different. In one case you have standard out[idx]=val type of assignment, while in the other one a surf3Dwrite() type. The rest of the kernel is identical.

类似

__global__ kernel(float * out , ....)
{

// 200 lines of math

// only difference, aside from input argument
idx=....
out[idx]=a;
}

vs

__global__ kernel(cudaSurfaceObject_t *  out, ...)
{

// 200 lines of math

// only difference, aside from input argument
  surf3Dwrite(&out,val,x,y,z);
}

在不复制粘贴整个内核并将其重命名的情况下,对此进行编码的明智方法是什么?我检查了模板,但是(如果我没记错的话)仅针对类型(如果模板中的类型不同),就不能仅仅拥有完全不同的代码行.CUDA内核似乎也无法过载.

What is the smart way of coding this, without copy pasting the entire kernel and renaming it? I checked Templating, but (if I am not wrong) its for types only, one can not just have a completely different line of code when the type is different in a template. CUDA kernels don't seem to be able to be overloaded either.

推荐答案

应该有可能使内核过载.这是一种使用重载(而不使用模板)的可能方法:

It should be possible to overload kernels. Here is one possible approach, using overloading (and no templating):

$ cat t1648.cu
// Includes, system
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>

#include <helper_cuda.h>

__device__ float my_common(float *d, int width, unsigned int x, unsigned int y){

// 200 lines of common code...
  return d[y *width +x];
}




////////////////////////////////////////////////////////////////////////////////
// Kernels
////////////////////////////////////////////////////////////////////////////////
//! Write to a cuArray using surface writes
//! @param gIData input data in global memory
////////////////////////////////////////////////////////////////////////////////
__global__ void WriteKernel(float *gIData, int width, int height,
                                       cudaSurfaceObject_t outputSurface)
{
    // calculate surface coordinates
    unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
    unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;

    // read from global memory and write to cuarray (via surface reference)
    surf2Dwrite(my_common(gIData, width, x, y),
                outputSurface, x*4, y, cudaBoundaryModeTrap);
}

__global__ void WriteKernel(float *gIData, int width, int height,
                                       float *out)
{
    // calculate coordinates
    unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
    unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;

    // read from global memory and write to global memory
    out[y*width+x] = my_common(gIData, width, x, y);
}

////////////////////////////////////////////////////////////////////////////////
// Program main
////////////////////////////////////////////////////////////////////////////////
int main(int argc, char **argv)
{
    printf("starting...\n");


    unsigned width = 256;
    unsigned height = 256;
    unsigned int size = width * height * sizeof(float);

    // Allocate device memory for result
    float *dData = NULL;
    checkCudaErrors(cudaMalloc((void **) &dData, size));

    // Allocate array and copy image data
    cudaChannelFormatDesc channelDesc =
        cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);
    cudaArray *cuArray;
    float *out;
    cudaMalloc(&out, size);
    checkCudaErrors(cudaMallocArray(&cuArray,
                                    &channelDesc,
                                    width,
                                    height,
                                    cudaArraySurfaceLoadStore));

    dim3 dimBlock(8, 8, 1);
    dim3 dimGrid(width / dimBlock.x, height / dimBlock.y, 1);

    cudaSurfaceObject_t outputSurface;
    cudaResourceDesc    surfRes;
    memset(&surfRes, 0, sizeof(cudaResourceDesc));
    surfRes.resType = cudaResourceTypeArray;
    surfRes.res.array.array = cuArray;

    checkCudaErrors(cudaCreateSurfaceObject(&outputSurface, &surfRes));
    WriteKernel<<<dimGrid, dimBlock>>>(dData, width, height, outputSurface);
    WriteKernel<<<dimGrid, dimBlock>>>(dData, width, height, out);

    checkCudaErrors(cudaDestroySurfaceObject(outputSurface));
    checkCudaErrors(cudaFree(dData));
    checkCudaErrors(cudaFreeArray(cuArray));
}
$ nvcc -I/usr/local/cuda/samples/common/inc t1648.cu -o t1648
$

上面的示例从simpleSurfaceWrite CUDA示例代码迅速地被黑了.它并非旨在发挥作用或正确"运行.它旨在显示从代码结构的角度来看如何使用重载来实现既定目标.

The above example was hacked together rapidly from the simpleSurfaceWrite CUDA sample code. It is not intended to be functional or run "correctly". It is designed to show how overloading can be used from a code structure standpoint to address the stated objective.

这篇关于具有不同输入的大型内核的智能设计,只需更改一行代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-28 07:53