在计算Nvidia GPU卡的GFLOPS时，假设每个核心有多少线程？

本文介绍了在计算Nvidia GPU卡的GFLOPS时，假设每个核心有多少线程？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有兴趣获得在GeForce GTX 550 Ti上执行1次双精度FLOP所需的纳秒数。

为了做到这一点，我采用了这种方法：我发现卡的单精度峰值性能是691.2 GFLOPS，这意味着双精度峰值性能为1/8，即86.4GFLOPS。然后为了获得每个核的FLOPS，我将86.4 GFLOPS除以核的数量192，这给出每个核0.45GFLOPS。 0.45 GFLOPS是指每个纳米每个核心0.45 FLOPS。如果我遵循正确的方法，那么我想知道每个核心运行多少线程来获得这些GFLOPS数字，我在哪里可以找到这个信息？

此外，我的小测试如下所示在236000232个循环中只执行一个线程。为了找到执行循环的1次迭代所需的时间（以纳秒为单位），我做236000232/10 ^ 6 = 236个循环。卡的着色器时钟是1800Mhz，这意味着需要236 / 1.8 = 131纳秒来执行循环的一次迭代。这个数字比上面的大得多（每个核心0.45纳秒）。我相信我在这里缺少的东西，因为数字是非常不同的。请帮助我理解它背后的数学。

  __global__ void bench_single（float * data）
 {
 int i; 
 double x = 1 .; 
 clock_t start，end; 
 start = clock（）; 
 for（i = 0; i  {
 x = x * 2.388415813 + 1.253314137; 
} 
 end = clock（）; 
 printf（End and start％d  - ％d\\\
，end，start）; 
 printf（Finished in％d cycles\\\
，end-start）; 
}

谢谢，

解决方案

计算能力2.1器件具有每个周期4次操作的双精度吞吐量（如果进行DFMA，则为8次）。

4 ops / cycle / SM * 4 SMs * 1800 MHz * 2 ops / DFMA = 56 GFLOPS double

这个计算假定所有线程都处于活动状态。

问题中的代码包含两个依赖操作融合到DFMA中。使用cuobjdump -sass检查程序集。如果在同一个SM上启动多个warp，则测试将成为依赖指令吞吐量而不是延迟的测量。

I am interested in obtaining the number of nano seconds it would take to execute 1 double precision FLOP on GeForce GTX 550 Ti.

In order to do that I am following this approach: I found out that the single precision peak performance of the card is 691.2 GFLOPS, which means the double precision peak performance would be 1/8 of it i.e. 86.4 GFLOPS. Then in order to obtain FLOPS per core, I divide the 86.4 GFLOPS by the number of cores, 192, which gives me 0.45 GFLOPS per core. 0.45 GFLOPS means 0.45 FLOPS per nano second per core. If I am following the correct approach, then I would like to know how many threads per core are run to obtain these GFLOPS numbers and where can I find this info?

Moreover, my small test shown below executes in 236000232 cycles by one thread only. In order to find the time (in nano seconds) it takes to execute 1 iteration of the loop, I do 236000232/10^6 = 236 cycles. The shader clock of the card is 1800Mhz, which means it takes 236/1.8 = 131 nano seconds to execute one iteration of the loop. This number is much bigger than the one above (0.45 nanoseconds per core). I am sure that I am missing something here, because the numbers are very different. Please help me to understand the math behind it.

    __global__ void bench_single(float *data)
{
    int i;
    double x = 1.;
    clock_t start, end;
    start = clock();
    for(i=0; i<1000000; i++)
    {
        x = x * 2.388415813 + 1.253314137;
    }
    end = clock();
    printf("End and start %d - %d\n", end, start);
    printf("Finished in %d cycles\n", end-start);
}

Thank you,

解决方案

Compute capability 2.1 devices has a double precision throughput of 4 operations per cycle (8 if doing DFMA). This assumes all 32 threads are active in the dispatched warp.

4 ops/cycle/SM * 4 SMs * 1800 MHz * 2 ops/DFMA = 56 GFLOPS double

The calculation assumes all threads in a warp are active.

The code in your question contains two dependent operations that could be fused into a DFMA. Use cuobjdump -sass to examine the assembly. If you launch multiple warps on the same SM the test turns into a measure of dependent instruction throughput not latency.

这篇关于在计算Nvidia GPU卡的GFLOPS时，假设每个核心有多少线程？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！