CUDA块和电网规模效率

本文介绍了CUDA块和电网规模效率的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在cuda中处理动态大小的数据集的建议方式是什么？

是基于问题集设置块和网格大小的情况'或者是否值得将块维度指定为2的因子，并且有一些内核逻辑来处理溢出？

我可以看到这可能是重要的alot为块的维度，但这是多么重要的网格维度？据我所知，实际的硬件约束在块级别停止（即分配给SM具有设定数量的SP的块，因此可以处理特定的warp大小）。

解决方案

我已经阅读了Kirk的'Programming Massively Parallel Processors'，但是没有真正涉及这个领域。通常情况是设置块大小以获得最佳性能，并根据总工作量设置网格大小。大多数内核在每个Mp中有一个最佳位置的warp数量，他们最好的工作，你应该做一些基准/分析，看看这是什么。你可能仍然需要内核中的溢出逻辑，因为问题大小很少是块大小的整数倍。

编辑：
为了给出一个具体的例子这可能是为一个简单的内核（在这种情况下是一个自定义BLAS级别1 dscal类型操作，作为一个Cholesky因式分解的打包对称带矩阵的一部分）：

  //融合平方根和dscal操作
 __global__ 
 void cdivkernel（const int n，double * a）
 {
 __shared__ double oneondiagv; 
 
 int imin = threadIdx.x + blockDim.x * blockIdx.x; 
 int istride = blockDim.x * gridDim.x; 
 
 if（threadIdx.x == 0）{
 oneondiagv = rsqrt（a [0]）; 
} 
 __syncthreads（）; 
 
 for（int i = imin; i  a [i] * = oneondiagv; 
} 
}

要启动此内核，执行参数计算为如下：

我们每个块最多允许4条经线（因此128条线程）。通常你会修正这个在一个最佳的数字，但在这种情况下内核通常调用非常小的向量，所以有一个可变块大小有一定意义。

然后我们计算块根据总工作量计算，总共达112个块，这相当于在14MP费米特尔萨上每块MP 8个块。

生成的包含执行参数计算和内核启动的包装器函数看起来像this：

  //将对角线元素root和dscal操作融合到
 //单个cdiv b $ b void fusedDscal（const int n，double * a）
 {
 // semibandwidth（列长度）决定
 //每列
 //矩阵。 
 const int warpSize = 32; 
 const int maxGridSize = 112; //这对于Telsa C2050来说是每个MP 8个块
 
 int warpCount =（n / warpSize）+（（n％warpSize）== 0）？0：1）; 
 int warpPerBlock = max（1，min（4，warpCount））; 
 
 //对于cdiv内核，块大小允许增长到
 //每个块四个warp，并且块计数变为四次以上的warp数量
 //或GPU填充以较小者为准
 int threadCount = warpSize * warpPerBlock; 
 int blockCount = min（maxGridSize，max（1，warpCount / warpPerBlock））; 
 dim3 BlockDim = dim3（threadCount，1,1）; 
 dim3 GridDim = dim3（blockCount，1,1）; 
 
 cdivkernel<<< GridDim，BlockDim>>（n，a）; 
 errchk（cudaPeekAtLastError（））; 
}

也许这提供了一些关于如何设计一个通用针对输入数据大小的执行参数。

What is the advised way of dealing with dynamically-sized datasets in cuda?

Is it a case of 'set the block and grid sizes based on the problem set' or is it worthwhile to assign block dimensions as factors of 2 and have some in-kernel logic to deal with the over-spill?

I can see how this probably matters alot for the block dimensions, but how much does this matter to the grid dimensions? As I understand it, the actual hardware constraints stop at the block level (i.e blocks assigned to SM's that have a set number of SP's, and so can handle a particular warp size).

I've perused Kirk's 'Programming Massively Parallel Processors' but it doesn't really touch on this area.

解决方案

It s usually a case of setting block size for optimal performance, and grid size according to the total amount of work. Most kernels have a "sweet spot" number of warps per Mp where they work best, and you should do some benchmarking/profiling to see where that is. You probably still need over-spill logic in the kernel because problem sizes are rarely round multiples of block sizes.

EDIT:To give a concrete example of how this might be done for a simple kernel (in this case a custom BLAS level 1 dscal type operation done as part of a Cholesky factorization of packed symmetric band matrices):

// Fused square root and dscal operation
__global__ 
void cdivkernel(const int n, double *a)
{
    __shared__ double oneondiagv;

    int imin = threadIdx.x + blockDim.x * blockIdx.x;
    int istride = blockDim.x * gridDim.x;

    if (threadIdx.x == 0) {
        oneondiagv = rsqrt( a[0] );
    }
    __syncthreads();

    for(int i=imin; i<n; i+=istride) {
        a[i] *= oneondiagv;
    }
}

To launch this kernel, the execution parameters are calculated as follows:

We allow up to 4 warps per block (so 128 threads). Normally you would fix this at an optimal number, but in this case the kernel is often called on very small vectors, so having a variable block size made some sense.
We then compute the block count according to the total amount of work, up to 112 total blocks, which is the equivalent of 8 blocks per MP on a 14 MP Fermi Telsa. The kernel will iterate if the amount of work exceeds grid size.

The resulting wrapper function containing the execution parameter calculations and kernel launch look like this:

// Fused the diagonal element root and dscal operation into
// a single "cdiv" operation
void fusedDscal(const int n, double *a)
{
    // The semibandwidth (column length) determines
    // how many warps are required per column of the 
    // matrix.
    const int warpSize = 32;
    const int maxGridSize = 112; // this is 8 blocks per MP for a Telsa C2050

    int warpCount = (n / warpSize) + (((n % warpSize) == 0) ? 0 : 1);
    int warpPerBlock = max(1, min(4, warpCount));

    // For the cdiv kernel, the block size is allowed to grow to
    // four warps per block, and the block count becomes the warp count over four
    // or the GPU "fill" whichever is smaller
    int threadCount = warpSize * warpPerBlock;
    int blockCount = min( maxGridSize, max(1, warpCount/warpPerBlock) );
    dim3 BlockDim = dim3(threadCount, 1, 1);
    dim3 GridDim  = dim3(blockCount, 1, 1);

    cdivkernel<<< GridDim,BlockDim >>>(n,a);
    errchk( cudaPeekAtLastError() );
}

Perhaps this gives some hints about how to design a "universal" scheme for setting execution parameters against input data size.

这篇关于CUDA块和电网规模效率的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！