本文介绍了在计算Nvidia GPU卡的GFLOPS时,假设每个核心有多少线程?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有兴趣获得在GeForce GTX 550 Ti上执行1次双精度FLOP所需的纳秒数。



为了做到这一点,我采用了这种方法:我发现卡的单精度峰值性能是691.2 GFLOPS,这意味着双精度峰值性能为1/8,即86.4GFLOPS。然后为了获得每个核的FLOPS,我将86.4 GFLOPS除以核的数量192,这给出每个核0.45GFLOPS。 0.45 GFLOPS是指每个纳米每个核心0.45 FLOPS。如果我遵循正确的方法,那么我想知道每个核心运行多少线程来获得这些GFLOPS数字,我在哪里可以找到这个信息?



此外,我的小测试如下所示在236000232个循环中只执行一个线程。为了找到执行循环的1次迭代所需的时间(以纳秒为单位),我做236000232/10 ^ 6 = 236个循环。卡的着色器时钟是1800Mhz,这意味着需要236 / 1.8 = 131纳秒来执行循环的一次迭代。这个数字比上面的大得多(每个核心0.45纳秒)。我相信我在这里缺少的东西,因为数字是非常不同的。请帮助我理解它背后的数学。

  __global__ void bench_single(float * data)
{
int i;
double x = 1 .;
clock_t start,end;
start = clock();
for(i = 0; i {
x = x * 2.388415813 + 1.253314137;
}
end = clock();
printf(End and start%d - %d\\\
,end,start);
printf(Finished in%d cycles\\\
,end-start);
}

谢谢,

解决方案

计算能力2.1器件具有每个周期4次操作的双精度吞吐量(如果进行DFMA,则为8次)。



4 ops / cycle / SM * 4 SMs * 1800 MHz * 2 ops / DFMA = 56 GFLOPS double

这个计算假定所有线程都处于活动状态。



问题中的代码包含两个依赖操作融合到DFMA中。使用cuobjdump -sass检查程序集。如果在同一个SM上启动多个warp,则测试将成为依赖指令吞吐量而不是延迟的测量。


I am interested in obtaining the number of nano seconds it would take to execute 1 double precision FLOP on GeForce GTX 550 Ti.

In order to do that I am following this approach: I found out that the single precision peak performance of the card is 691.2 GFLOPS, which means the double precision peak performance would be 1/8 of it i.e. 86.4 GFLOPS. Then in order to obtain FLOPS per core, I divide the 86.4 GFLOPS by the number of cores, 192, which gives me 0.45 GFLOPS per core. 0.45 GFLOPS means 0.45 FLOPS per nano second per core. If I am following the correct approach, then I would like to know how many threads per core are run to obtain these GFLOPS numbers and where can I find this info?

Moreover, my small test shown below executes in 236000232 cycles by one thread only. In order to find the time (in nano seconds) it takes to execute 1 iteration of the loop, I do 236000232/10^6 = 236 cycles. The shader clock of the card is 1800Mhz, which means it takes 236/1.8 = 131 nano seconds to execute one iteration of the loop. This number is much bigger than the one above (0.45 nanoseconds per core). I am sure that I am missing something here, because the numbers are very different. Please help me to understand the math behind it.

    __global__ void bench_single(float *data)
{
    int i;
    double x = 1.;
    clock_t start, end;
    start = clock();
    for(i=0; i<1000000; i++)
    {
        x = x * 2.388415813 + 1.253314137;
    }
    end = clock();
    printf("End and start %d - %d\n", end, start);
    printf("Finished in %d cycles\n", end-start);
}

Thank you,

解决方案

Compute capability 2.1 devices has a double precision throughput of 4 operations per cycle (8 if doing DFMA). This assumes all 32 threads are active in the dispatched warp.

4 ops/cycle/SM * 4 SMs * 1800 MHz * 2 ops/DFMA = 56 GFLOPS double

The calculation assumes all threads in a warp are active.

The code in your question contains two dependent operations that could be fused into a DFMA. Use cuobjdump -sass to examine the assembly. If you launch multiple warps on the same SM the test turns into a measure of dependent instruction throughput not latency.

这篇关于在计算Nvidia GPU卡的GFLOPS时,假设每个核心有多少线程?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-16 00:35