本文介绍了将32位循环计数器替换为64位会在Intel CPU上使用_mm_popcnt_u64引起疯狂的性能偏差的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在寻找最快的方法来popcount大数据数组.我遇到了一个非常奇怪的效果:将循环变量从unsigned更改为uint64_t会使我的PC的性能下降了50%.

I was looking for the fastest way to popcount large arrays of data. I encountered a very weird effect: Changing the loop variable from unsigned to uint64_t made the performance drop by 50% on my PC.

#include <iostream>
#include <chrono>
#include <x86intrin.h>

int main(int argc, char* argv[]) {

    using namespace std;
    if (argc != 2) {
       cerr << "usage: array_size in MB" << endl;
       return -1;
    }

    uint64_t size = atol(argv[1])<<20;
    uint64_t* buffer = new uint64_t[size/8];
    char* charbuffer = reinterpret_cast<char*>(buffer);
    for (unsigned i=0; i<size; ++i)
        charbuffer[i] = rand()%256;

    uint64_t count,duration;
    chrono::time_point<chrono::system_clock> startP,endP;
    {
        startP = chrono::system_clock::now();
        count = 0;
        for( unsigned k = 0; k < 10000; k++){
            // Tight unrolled loop with unsigned
            for (unsigned i=0; i<size/8; i+=4) {
                count += _mm_popcnt_u64(buffer[i]);
                count += _mm_popcnt_u64(buffer[i+1]);
                count += _mm_popcnt_u64(buffer[i+2]);
                count += _mm_popcnt_u64(buffer[i+3]);
            }
        }
        endP = chrono::system_clock::now();
        duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
        cout << "unsigned\t" << count << '\t' << (duration/1.0E9) << " sec \t"
             << (10000.0*size)/(duration) << " GB/s" << endl;
    }
    {
        startP = chrono::system_clock::now();
        count=0;
        for( unsigned k = 0; k < 10000; k++){
            // Tight unrolled loop with uint64_t
            for (uint64_t i=0;i<size/8;i+=4) {
                count += _mm_popcnt_u64(buffer[i]);
                count += _mm_popcnt_u64(buffer[i+1]);
                count += _mm_popcnt_u64(buffer[i+2]);
                count += _mm_popcnt_u64(buffer[i+3]);
            }
        }
        endP = chrono::system_clock::now();
        duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
        cout << "uint64_t\t"  << count << '\t' << (duration/1.0E9) << " sec \t"
             << (10000.0*size)/(duration) << " GB/s" << endl;
    }

    free(charbuffer);
}

如您所见,我们创建了一个随机数据缓冲区,大小为x兆字节,其中从命令行读取x.之后,我们遍历缓冲区并使用x86 popcount内在函数的展开版本来执行popcount.为了获得更精确的结果,我们将弹出次数进行了10,000次.我们计算弹出次数的时间.在大写情况下,内部循环变量为unsigned,在小写情况下,内部循环变量为uint64_t.我认为这应该没什么区别,但情况恰恰相反.

As you see, we create a buffer of random data, with the size being x megabytes where x is read from the command line. Afterwards, we iterate over the buffer and use an unrolled version of the x86 popcount intrinsic to perform the popcount. To get a more precise result, we do the popcount 10,000 times. We measure the times for the popcount. In the upper case, the inner loop variable is unsigned, in the lower case, the inner loop variable is uint64_t. I thought that this should make no difference, but the opposite is the case.

我这样编译(g ++版本:Ubuntu 4.8.2-19ubuntu1):

I compile it like this (g++ version: Ubuntu 4.8.2-19ubuntu1):

g++ -O3 -march=native -std=c++11 test.cpp -o test

以下是我的 Haswell Core i7-4770K CPU @ 3.50  GHz,运行test 1(因此1MB随机数据):

Here are the results on my Haswell Core i7-4770K CPU @ 3.50 GHz, running test 1 (so 1 MB random data):

  • 未签名的41959360000 0.401554秒 26.113GB/s
  • uint64_t 41959360000 0.759822秒 13.8003GB/s
  • unsigned 41959360000 0.401554 sec 26.113 GB/s
  • uint64_t 41959360000 0.759822 sec 13.8003 GB/s

如您所见,uint64_t版本的吞吐量仅为unsigned版本之一的一半!问题似乎在于生成了不同的程序集,但是为什么呢?首先,我想到了编译器错误,因此尝试了clang++(Ubuntu Clang 版本3.4- 1ubuntu3):

As you see, the throughput of the uint64_t version is only half the one of the unsigned version! The problem seems to be that different assembly gets generated, but why? First, I thought of a compiler bug, so I tried clang++ (Ubuntu Clang version 3.4-1ubuntu3):

clang++ -O3 -march=native -std=c++11 teest.cpp -o test

结果:test 1

  • 未签名的41959360000 0.398293秒 26.3267 GB/秒
  • uint64_t 41959360000 0.680954秒 15.3986 GB/秒
  • unsigned 41959360000 0.398293 sec 26.3267 GB/s
  • uint64_t 41959360000 0.680954 sec 15.3986 GB/s

因此,它几乎是相同的结果,但仍然很奇怪. 但是现在它变得非常奇怪.我用一个常量1替换了从输入中读取的缓冲区大小,所以我进行了更改:

So, it is almost the same result and is still strange. But now it gets super strange. I replace the buffer size that was read from input with a constant 1, so I change:

uint64_t size = atol(argv[1]) << 20;

uint64_t size = 1 << 20;

因此,编译器现在在编译时就知道缓冲区的大小.也许可以添加一些优化!以下是g++的数字:

Thus, the compiler now knows the buffer size at compile time. Maybe it can add some optimizations! Here are the numbers for g++:

  • 未签名的41959360000 0.509156秒 20.5944GB/s
  • uint64_t 41959360000 0.508673秒 20.6139GB/s
  • unsigned 41959360000 0.509156 sec 20.5944 GB/s
  • uint64_t 41959360000 0.508673 sec 20.6139 GB/s

现在,两个版本都同样快.但是,unsigned 变得更慢!它从26下降到20 GB/s,因此用常数替换非常数会导致去优化.说真的,我不知道这是怎么回事!但是现在使用新版本的clang++:

Now, both versions are equally fast. However, the unsigned got even slower! It dropped from 26 to 20 GB/s, thus replacing a non-constant by a constant value lead to a deoptimization. Seriously, I have no clue what is going on here! But now to clang++ with the new version:

  • 未签名的41959360000 0.677009秒 15.4884GB/s
  • uint64_t 41959360000 0.676909秒 15.4906GB/s
  • unsigned 41959360000 0.677009 sec 15.4884 GB/s
  • uint64_t 41959360000 0.676909 sec 15.4906 GB/s

等等,什么?现在,两个版本均降至15GB/s的慢速数量.因此,在两种情况下,对于Clang,用恒定值替换非常数都会导致代码变慢!

Wait, what? Now, both versions dropped to the slow number of 15 GB/s. Thus, replacing a non-constant by a constant value even lead to slow code in both cases for Clang!

我请具有 Ivy Bridge CPU的同事来编译我的基准测试.他得到了类似的结果,因此似乎不是Haswell.因为两个编译器在这里产生奇怪的结果,所以它似乎也不是一个编译器错误.我们这里没有AMD CPU,因此只能在Intel上进行测试.

I asked a colleague with an Ivy Bridge CPU to compile my benchmark. He got similar results, so it does not seem to be Haswell. Because two compilers produce strange results here, it also does not seem to be a compiler bug. We do not have an AMD CPU here, so we could only test with Intel.

以第一个示例(带有atol(argv[1])的示例)为例,并将static放在变量之前,即:

Take the first example (the one with atol(argv[1])) and put a static before the variable, i.e.:

static uint64_t size=atol(argv[1])<<20;

这是我在g ++中的结果:

Here are my results in g++:

  • 未签名41959360000 0.396728秒 26.4306 GB/s
  • uint64_t 41959360000 0.509484秒 20.5811 GB/秒
  • unsigned 41959360000 0.396728 sec 26.4306 GB/s
  • uint64_t 41959360000 0.509484 sec 20.5811 GB/s

是的,还有另一种选择.我们的u32仍然具有26GB/s的快速传输速度,但我们设法将u64至少从13GB/s升级到20GB/s! 在我同事的PC上,u64版本变得比u32版本更快,从而产生了最快的结果.可悲的是,这仅适用于g++clang++似乎不起作用关心static.

Yay, yet another alternative. We still have the fast 26 GB/s with u32, but we managed to get u64 at least from the 13 GB/s to the 20 GB/s version! On my collegue's PC, the u64 version became even faster than the u32 version, yielding the fastest result of all. Sadly, this only works for g++, clang++ does not seem to care about static.

您能解释这些结果吗?特别是:

Can you explain these results? Especially:

  • u32u64之间如何有这样的区别?
  • 如何用恒定的缓冲区大小替换非常量会触发较少的最佳代码?
  • 如何插入static关键字使u64循环更快?比我同事的计算机上的原始代码还要快!
  • How can there be such a difference between u32 and u64?
  • How can replacing a non-constant by a constant buffer size trigger less optimal code?
  • How can the insertion of the static keyword make the u64 loop faster? Even faster than the original code on my collegue's computer!

我知道优化是一个棘手的领域,但是,我从来没有想到过如此小的更改会导致执行时间出现 100%差异,而诸如恒定缓冲区大小之类的小因素可能会再次混合完全结果.当然,我一直希望拥有能够以26 GB/s的速度增长的版本.我能想到的唯一可靠的方法是针对这种情况复制粘贴程序集并使用内联程序集.这是我摆脱似乎对微小更改感到恼火的编译器​​的唯一方法.你怎么认为?有没有其他方法可以可靠地获得性能最高的代码?

I know that optimization is a tricky territory, however, I never thought that such small changes can lead to a 100% difference in execution time and that small factors like a constant buffer size can again mix results totally. Of course, I always want to have the version that is able to popcount 26 GB/s. The only reliable way I can think of is copy paste the assembly for this case and use inline assembly. This is the only way I can get rid of compilers that seem to go mad on small changes. What do you think? Is there another way to reliably get the code with most performance?

以下是各种结果的反汇编:

Here is the disassembly for the various results:

26 GB/s版本,来自 g ++/u32/非常量bufsize :

26 GB/s version from g++ / u32 / non-const bufsize:

0x400af8:
lea 0x1(%rdx),%eax
popcnt (%rbx,%rax,8),%r9
lea 0x2(%rdx),%edi
popcnt (%rbx,%rcx,8),%rax
lea 0x3(%rdx),%esi
add %r9,%rax
popcnt (%rbx,%rdi,8),%rcx
add $0x4,%edx
add %rcx,%rax
popcnt (%rbx,%rsi,8),%rcx
add %rcx,%rax
mov %edx,%ecx
add %rax,%r14
cmp %rbp,%rcx
jb 0x400af8

13 GB/s版本,来自 g ++/u64/非常量bufsize :

13 GB/s version from g++ / u64 / non-const bufsize:

0x400c00:
popcnt 0x8(%rbx,%rdx,8),%rcx
popcnt (%rbx,%rdx,8),%rax
add %rcx,%rax
popcnt 0x10(%rbx,%rdx,8),%rcx
add %rcx,%rax
popcnt 0x18(%rbx,%rdx,8),%rcx
add $0x4,%rdx
add %rcx,%rax
add %rax,%r12
cmp %rbp,%rdx
jb 0x400c00

clang ++/u64/non-const bufsize 中的15GB/s版本:

15 GB/s version from clang++ / u64 / non-const bufsize:

0x400e50:
popcnt (%r15,%rcx,8),%rdx
add %rbx,%rdx
popcnt 0x8(%r15,%rcx,8),%rsi
add %rdx,%rsi
popcnt 0x10(%r15,%rcx,8),%rdx
add %rsi,%rdx
popcnt 0x18(%r15,%rcx,8),%rbx
add %rdx,%rbx
add $0x4,%rcx
cmp %rbp,%rcx
jb 0x400e50

g ++/u32& u64/const bufsize 中的20GB/s版本:

20 GB/s version from g++ / u32&u64 / const bufsize:

0x400a68:
popcnt (%rbx,%rdx,1),%rax
popcnt 0x8(%rbx,%rdx,1),%rcx
add %rax,%rcx
popcnt 0x10(%rbx,%rdx,1),%rax
add %rax,%rcx
popcnt 0x18(%rbx,%rdx,1),%rsi
add $0x20,%rdx
add %rsi,%rcx
add %rcx,%rbp
cmp $0x100000,%rdx
jne 0x400a68

clang ++/u32& u64/const bufsize 中的15GB/s版本:

15 GB/s version from clang++ / u32&u64 / const bufsize:

0x400dd0:
popcnt (%r14,%rcx,8),%rdx
add %rbx,%rdx
popcnt 0x8(%r14,%rcx,8),%rsi
add %rdx,%rsi
popcnt 0x10(%r14,%rcx,8),%rdx
add %rsi,%rdx
popcnt 0x18(%r14,%rcx,8),%rbx
add %rdx,%rbx
add $0x4,%rcx
cmp $0x20000,%rcx
jb 0x400dd0

有趣的是,最快的版本(26 GB/s)也是最长的!它似乎是唯一使用lea的解决方案.一些版本使用jb进行跳转,另一些版本使用jne.但除此之外,所有版本似乎都是可比的.我看不出100%的性能差距可能源于何处,但我不太擅长破译汇编.最慢的版本(13 GB/s)看起来非常短而且很好.谁能解释一下?

Interestingly, the fastest (26 GB/s) version is also the longest! It seems to be the only solution that uses lea. Some versions use jb to jump, others use jne. But apart from that, all versions seem to be comparable. I don't see where a 100% performance gap could originate from, but I am not too adept at deciphering assembly. The slowest (13 GB/s) version looks even very short and good. Can anyone explain this?

无论这个问题的答案是什么;我了解到,在真正的热循环中,每个细节都可能很重要,甚至似乎与该热代码没有任何关联的细节.我从没想过要为循环变量使用哪种类型,但是正如您看到的那样,微小的更改可能会产生 100%的差异!正如我们在size变量前面插入static关键字所看到的那样,甚至缓冲区的存储类型也可以产生巨大的变化!将来,在编写对系统性能至关重要的紧密而又热的循环时,我将始终在各种编译器上测试各种替代方案.

No matter what the answer to this question will be; I have learned that in really hot loops every detail can matter, even details that do not seem to have any association to the hot code. I have never thought about what type to use for a loop variable, but as you see such a minor change can make a 100% difference! Even the storage type of a buffer can make a huge difference, as we saw with the insertion of the static keyword in front of the size variable! In the future, I will always test various alternatives on various compilers when writing really tight and hot loops that are crucial for system performance.

有趣的是,尽管我已经四次展开循环,但性能差异仍然很高.因此,即使展开,仍然会受到重大性能偏差的影响.非常有趣.

The interesting thing is also that the performance difference is still so high although I have already unrolled the loop four times. So even if you unroll, you can still get hit by major performance deviations. Quite interesting.

推荐答案

罪魁祸首:错误的数据依赖项(而且编译器甚至没有意识到)

Culprit: False Data Dependency (and the compiler isn't even aware of it)

在Sandy/Ivy Bridge和Haswell处理器上,指令:

On Sandy/Ivy Bridge and Haswell processors, the instruction:

popcnt  src, dest

似乎对目标寄存器dest具有错误的依赖性.即使指令只写指令,指令也会等到dest准备好后再执行.英特尔现已(现在)将此错误依赖性记录为勘误 HSD146(哈斯韦尔) SKL029(Skylake)

appears to have a false dependency on the destination register dest. Even though the instruction only writes to it, the instruction will wait until dest is ready before executing. This false dependency is (now) documented by Intel as erratum HSD146 (Haswell) and SKL029 (Skylake)

Skylake为lzcnttzcnt修复了此问题.
Cannon Lake(和Ice Lake)已针对popcnt修复了此问题.
bsf/bsr具有真正的输出依赖性:输入= 0时输出未修改. (但是无法利用内在函数来利用它-只有AMD记录了它,而编译器没有公开它.)

Skylake fixed this for lzcnt and tzcnt.
Cannon Lake (and Ice Lake) fixed this for popcnt.
bsf/bsr have a true output dependency: output unmodified for input=0. (But no way to take advantage of that with intrinsics - only AMD documents it and compilers don't expose it.)

(是的,这些说明都在同一执行单元上运行 )

(Yes, these instructions all run on the same execution unit).

此依赖关系不只是在单次循环迭代中占用4个popcnt.它可以进行循环迭代,从而使处理器无法并行化不同的循环迭代.

This dependency doesn't just hold up the 4 popcnts from a single loop iteration. It can carry across loop iterations making it impossible for the processor to parallelize different loop iterations.

unsigneduint64_t和其他调整不会直接影响问题.但是它们会影响寄存器分配器,后者将寄存器分配给变量.

The unsigned vs. uint64_t and other tweaks don't directly affect the problem. But they influence the register allocator which assigns the registers to the variables.

在您的情况下,速度是(取决于)寄存器分配器决定执行的操作(是否为假)依赖于(假)依赖关系链的直接结果.

In your case, the speeds are a direct result of what is stuck to the (false) dependency chain depending on what the register allocator decided to do.

  • 13 GB/s具有一条链:popcnt-add-popcnt-popcnt→下一次迭代
  • 15 GB/s具有一条链:popcnt-add-popcnt-add→下一次迭代
  • 20 GB/s具有一条链:popcnt-popcnt→下一次迭代
  • 26 GB/s具有一条链:popcnt-popcnt→下一次迭代
  • 13 GB/s has a chain: popcnt-add-popcnt-popcnt → next iteration
  • 15 GB/s has a chain: popcnt-add-popcnt-add → next iteration
  • 20 GB/s has a chain: popcnt-popcnt → next iteration
  • 26 GB/s has a chain: popcnt-popcnt → next iteration

20 GB/s和26 GB/s之间的差异似乎只是间接寻址的一个小产物.无论哪种方式,一旦达到此速度,处理器就会开始遇到其他瓶颈.

The difference between 20 GB/s and 26 GB/s seems to be a minor artifact of the indirect addressing. Either way, the processor starts to hit other bottlenecks once you reach this speed.

为了测试这一点,我使用了内联汇编来绕过编译器,并确切地得到我想要的汇编.我还拆分了count变量,以打破可能与基准混淆的所有其他依赖项.

To test this, I used inline assembly to bypass the compiler and get exactly the assembly I want. I also split up the count variable to break all other dependencies that might mess with the benchmarks.

以下是结果:

Sandy Bridge Xeon @ 3.5 GHz:(完整的测试代码位于底部)

Sandy Bridge Xeon @ 3.5 GHz: (full test code can be found at the bottom)

  • GCC 4.6.3:g++ popcnt.cpp -std=c++0x -O3 -save-temps -march=native
  • Ubuntu 12

不同的寄存器: 18.6195 GB/秒

.L4:
    movq    (%rbx,%rax,8), %r8
    movq    8(%rbx,%rax,8), %r9
    movq    16(%rbx,%rax,8), %r10
    movq    24(%rbx,%rax,8), %r11
    addq    $4, %rax

    popcnt %r8, %r8
    add    %r8, %rdx
    popcnt %r9, %r9
    add    %r9, %rcx
    popcnt %r10, %r10
    add    %r10, %rdi
    popcnt %r11, %r11
    add    %r11, %rsi

    cmpq    $131072, %rax
    jne .L4

相同的寄存器: 8.49272 GB/秒

.L9:
    movq    (%rbx,%rdx,8), %r9
    movq    8(%rbx,%rdx,8), %r10
    movq    16(%rbx,%rdx,8), %r11
    movq    24(%rbx,%rdx,8), %rbp
    addq    $4, %rdx

    # This time reuse "rax" for all the popcnts.
    popcnt %r9, %rax
    add    %rax, %rcx
    popcnt %r10, %rax
    add    %rax, %rsi
    popcnt %r11, %rax
    add    %rax, %r8
    popcnt %rbp, %rax
    add    %rax, %rdi

    cmpq    $131072, %rdx
    jne .L9

同一注册链断: 17.8869 GB/秒

.L14:
    movq    (%rbx,%rdx,8), %r9
    movq    8(%rbx,%rdx,8), %r10
    movq    16(%rbx,%rdx,8), %r11
    movq    24(%rbx,%rdx,8), %rbp
    addq    $4, %rdx

    # Reuse "rax" for all the popcnts.
    xor    %rax, %rax    # Break the cross-iteration dependency by zeroing "rax".
    popcnt %r9, %rax
    add    %rax, %rcx
    popcnt %r10, %rax
    add    %rax, %rsi
    popcnt %r11, %rax
    add    %rax, %r8
    popcnt %rbp, %rax
    add    %rax, %rdi

    cmpq    $131072, %rdx
    jne .L14


那么编译器出了什么问题?

似乎GCC和Visual Studio都不知道popcnt具有这种错误的依赖关系.但是,这些错误的依赖关系并不少见.只是编译器是否知道它.

It seems that neither GCC nor Visual Studio are aware that popcnt has such a false dependency. Nevertheless, these false dependencies aren't uncommon. It's just a matter of whether the compiler is aware of it.

popcnt并不是最常用的指令.因此,大型编译器可能会错过这样的内容,这并不令人感到意外.似乎也没有任何文档提到此问题.如果英特尔不公开,那么除非有人偶然碰到它,否则外界不会知道.

popcnt isn't exactly the most used instruction. So it's not really a surprise that a major compiler could miss something like this. There also appears to be no documentation anywhere that mentions this problem. If Intel doesn't disclose it, then nobody outside will know until someone runs into it by chance.

(更新:从4.9版开始.2 ,GCC意识到这种虚假依赖性,并在启用优化后生成代码以对其进行补偿.其他厂商的主要编译器,包括Clang,MSVC甚至是英特尔自己的ICC,都尚未意识到这种微体系结构的错误.并且不会发出补偿它的代码.)

(Update: As of version 4.9.2, GCC is aware of this false-dependency and generates code to compensate it when optimizations are enabled. Major compilers from other vendors, including Clang, MSVC, and even Intel's own ICC are not yet aware of this microarchitectural erratum and will not emit code that compensates for it.)

为什么CPU具有这种错误的依赖性?

我们可以推测:它在与bsf/bsr相同的执行单元上运行,而 do 具有输出依赖性. (如何在硬件中实现POPCNT?).对于这些指令,Intel将input = 0的整数结果记录为未定义"(ZF = 1),但是Intel硬件实际上为避免破坏旧软件提供了更强的保证:输出未修改. AMD记录了这种行为.

We can speculate: it runs on the same execution unit as bsf / bsr which do have an output dependency. (How is POPCNT implemented in hardware?). For those instructions, Intel documents the integer result for input=0 as "undefined" (with ZF=1), but Intel hardware actually gives a stronger guarantee to avoid breaking old software: output unmodified. AMD documents this behaviour.

大概不方便地为此执行单元的输出依赖于输出,而另一些则不然.

Presumably it was somehow inconvenient to make some uops for this execution unit dependent on the output but others not.

AMD处理器似乎没有这种虚假依赖性.

AMD processors do not appear to have this false dependency.

完整的测试代码如下:

#include <iostream>
#include <chrono>
#include <x86intrin.h>

int main(int argc, char* argv[]) {

   using namespace std;
   uint64_t size=1<<20;

   uint64_t* buffer = new uint64_t[size/8];
   char* charbuffer=reinterpret_cast<char*>(buffer);
   for (unsigned i=0;i<size;++i) charbuffer[i]=rand()%256;

   uint64_t count,duration;
   chrono::time_point<chrono::system_clock> startP,endP;
   {
      uint64_t c0 = 0;
      uint64_t c1 = 0;
      uint64_t c2 = 0;
      uint64_t c3 = 0;
      startP = chrono::system_clock::now();
      for( unsigned k = 0; k < 10000; k++){
         for (uint64_t i=0;i<size/8;i+=4) {
            uint64_t r0 = buffer[i + 0];
            uint64_t r1 = buffer[i + 1];
            uint64_t r2 = buffer[i + 2];
            uint64_t r3 = buffer[i + 3];
            __asm__(
                "popcnt %4, %4  \n\t"
                "add %4, %0     \n\t"
                "popcnt %5, %5  \n\t"
                "add %5, %1     \n\t"
                "popcnt %6, %6  \n\t"
                "add %6, %2     \n\t"
                "popcnt %7, %7  \n\t"
                "add %7, %3     \n\t"
                : "+r" (c0), "+r" (c1), "+r" (c2), "+r" (c3)
                : "r"  (r0), "r"  (r1), "r"  (r2), "r"  (r3)
            );
         }
      }
      count = c0 + c1 + c2 + c3;
      endP = chrono::system_clock::now();
      duration=chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
      cout << "No Chain\t" << count << '\t' << (duration/1.0E9) << " sec \t"
            << (10000.0*size)/(duration) << " GB/s" << endl;
   }
   {
      uint64_t c0 = 0;
      uint64_t c1 = 0;
      uint64_t c2 = 0;
      uint64_t c3 = 0;
      startP = chrono::system_clock::now();
      for( unsigned k = 0; k < 10000; k++){
         for (uint64_t i=0;i<size/8;i+=4) {
            uint64_t r0 = buffer[i + 0];
            uint64_t r1 = buffer[i + 1];
            uint64_t r2 = buffer[i + 2];
            uint64_t r3 = buffer[i + 3];
            __asm__(
                "popcnt %4, %%rax   \n\t"
                "add %%rax, %0      \n\t"
                "popcnt %5, %%rax   \n\t"
                "add %%rax, %1      \n\t"
                "popcnt %6, %%rax   \n\t"
                "add %%rax, %2      \n\t"
                "popcnt %7, %%rax   \n\t"
                "add %%rax, %3      \n\t"
                : "+r" (c0), "+r" (c1), "+r" (c2), "+r" (c3)
                : "r"  (r0), "r"  (r1), "r"  (r2), "r"  (r3)
                : "rax"
            );
         }
      }
      count = c0 + c1 + c2 + c3;
      endP = chrono::system_clock::now();
      duration=chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
      cout << "Chain 4   \t"  << count << '\t' << (duration/1.0E9) << " sec \t"
            << (10000.0*size)/(duration) << " GB/s" << endl;
   }
   {
      uint64_t c0 = 0;
      uint64_t c1 = 0;
      uint64_t c2 = 0;
      uint64_t c3 = 0;
      startP = chrono::system_clock::now();
      for( unsigned k = 0; k < 10000; k++){
         for (uint64_t i=0;i<size/8;i+=4) {
            uint64_t r0 = buffer[i + 0];
            uint64_t r1 = buffer[i + 1];
            uint64_t r2 = buffer[i + 2];
            uint64_t r3 = buffer[i + 3];
            __asm__(
                "xor %%rax, %%rax   \n\t"   // <--- Break the chain.
                "popcnt %4, %%rax   \n\t"
                "add %%rax, %0      \n\t"
                "popcnt %5, %%rax   \n\t"
                "add %%rax, %1      \n\t"
                "popcnt %6, %%rax   \n\t"
                "add %%rax, %2      \n\t"
                "popcnt %7, %%rax   \n\t"
                "add %%rax, %3      \n\t"
                : "+r" (c0), "+r" (c1), "+r" (c2), "+r" (c3)
                : "r"  (r0), "r"  (r1), "r"  (r2), "r"  (r3)
                : "rax"
            );
         }
      }
      count = c0 + c1 + c2 + c3;
      endP = chrono::system_clock::now();
      duration=chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
      cout << "Broken Chain\t"  << count << '\t' << (duration/1.0E9) << " sec \t"
            << (10000.0*size)/(duration) << " GB/s" << endl;
   }

   free(charbuffer);
}


可以在此处找到同样有趣的基准: http://pastebin.com/kbzgL8si
此基准会改变(false)依赖关系链中popcnt的数量.


An equally interesting benchmark can be found here: http://pastebin.com/kbzgL8si
This benchmark varies the number of popcnts that are in the (false) dependency chain.

False Chain 0:  41959360000 0.57748 sec     18.1578 GB/s
False Chain 1:  41959360000 0.585398 sec    17.9122 GB/s
False Chain 2:  41959360000 0.645483 sec    16.2448 GB/s
False Chain 3:  41959360000 0.929718 sec    11.2784 GB/s
False Chain 4:  41959360000 1.23572 sec     8.48557 GB/s

这篇关于将32位循环计数器替换为64位会在Intel CPU上使用_mm_popcnt_u64引起疯狂的性能偏差的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-29 09:39