本文介绍了如何在此循环中获得一致的高吞吐量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在优化内部循环的过程中,我遇到了奇怪的性能行为,难以理解和纠正.

In the course of optimising an inner loop I have come across strange performance behaviour that I'm having trouble understanding and correcting.

下面是该代码的简化版本;粗略地说,有一个巨大的数组,分为16个单词块,我简单地将每个块中单词的前导零个数相加. (实际上,我使用的是 Dan Luu 中的popcnt代码,但在这里我选择了具有类似简短"性能特征的更简单指令.Dan Luu的代码基于对此SO问题的答案它确实具有令人惊讶的相似结果,似乎在这里没有回答我的问题.)

A pared-down version of the code follows; roughly speaking there is one gigantic array which is divided up into 16 word chunks, and I simply add up the number of leading zeroes of the words in each chunk. (In reality I'm using the popcnt code from Dan Luu, but here I picked a simpler instruction with similar performance characteristics for "brevity". Dan Luu's code is based on an answer to this SO question which, while it has tantalisingly similar strange results, does not seem to answer my questions here.)

// -*- compile-command: "gcc -O3 -march=native -Wall -Wextra -std=c99 -o clz-timing clz-timing.c" -*-
#include <stdint.h>
#include <time.h>
#include <stdlib.h>
#include <stdio.h>

#define ARRAY_LEN 16

// Return the sum of the leading zeros of each element of the ARRAY_LEN
// words starting at u.
static inline uint64_t clz_array(const uint64_t u[ARRAY_LEN]) {
    uint64_t c0 = 0;
    for (int i = 0; i < ARRAY_LEN; ++i) {
        uint64_t t0;
        __asm__ ("lzcnt %1, %0" : "=r"(t0) : "r"(u[i]));
        c0 += t0;
    }
    return c0;
}

// For each of the narrays blocks of ARRAY_LEN words starting at
// arrays, put the result of clz_array(arrays + i*ARRAY_LEN) in
// counts[i]. Return the time taken in milliseconds.
double clz_arrays(uint32_t *counts, const uint64_t *arrays, int narrays) {
    clock_t t = clock();
    for (int i = 0; i < narrays; ++i, arrays += ARRAY_LEN)
        counts[i] = clz_array(arrays);
    t = clock() - t;
    // Convert clock time to milliseconds
    return t * 1e3 / (double)CLOCKS_PER_SEC;
}

void print_stats(double t_ms, long n, double total_MiB) {
    double t_s = t_ms / 1e3, thru = (n/1e6) / t_s, band = total_MiB / t_s;
    printf("Time: %7.2f ms, %7.2f x 1e6 clz/s, %8.1f MiB/s\n", t_ms, thru, band);
}

int main(int argc, char *argv[]) {
    long n = 1 << 20;
    if (argc > 1)
        n = atol(argv[1]);

    long total_bytes = n * ARRAY_LEN * sizeof(uint64_t);
    uint64_t *buf = malloc(total_bytes);
    uint32_t *counts = malloc(sizeof(uint32_t) * n);
    double t_ms, total_MiB = total_bytes / (double)(1 << 20);

    printf("Total size: %.1f MiB\n", total_MiB);

    // Warm up
    t_ms = clz_arrays(counts, buf, n);
    //print_stats(t_ms, n, total_MiB);    // (1)
    // Run it
    t_ms = clz_arrays(counts, buf, n);    // (2)
    print_stats(t_ms, n, total_MiB);

    // Write something into buf
    for (long i = 0; i < n*ARRAY_LEN; ++i)
        buf[i] = i;

    // And again...
    (void) clz_arrays(counts, buf, n);    // (3)
    t_ms = clz_arrays(counts, buf, n);    // (4)
    print_stats(t_ms, n, total_MiB);

    free(counts);
    free(buf);
    return 0;
}

上面的代码有一点奇怪,那就是我第一次和第二次调用clz_arrays函数是在未初始化的内存上.

The slightly peculiar thing about the code above is that the first and second times I call the clz_arrays function it is on uninitialised memory.

这是典型运行的结果(编译器命令位于源代码的开头):

Here is the result of a typical run (compiler command is at the beginning of the source):

$ ./clz-timing 10000000
Total size: 1220.7 MiB
Time:   47.78 ms,  209.30 x 1e6 clz/s,  25548.9 MiB/s
Time:   77.41 ms,  129.19 x 1e6 clz/s,  15769.7 MiB/s

运行该处理器的CPU是英特尔®酷睿TM i7-6700HQ CPU @ 2.60GHz",具有3.5GHz的涡轮增压. lzcnt指令的等待时间为3个周期,但其吞吐量为每秒1次操作(请参见 Agner Fog的Skylake指令表),因此在3.5GHz下使用8个字节的字(使用uint64_t)时,峰值带宽应为3.5e9 cycles/sec x 8 bytes/cycle = 28.0 GiB/s,这与我们在第一个数字中看到的非常接近.即使在2.6GHz,我们也应该接近20.8 GiB/s.

The CPU on which this was run is an "Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz" which has a turbo boost of 3.5GHz. The latency of the lzcnt instruction is 3 cycles but it has a throughput of 1 operation per second (see Agner Fog's Skylake instruction tables) so, with 8 byte words (using uint64_t) at 3.5GHz the peak bandwidth should be 3.5e9 cycles/sec x 8 bytes/cycle = 28.0 GiB/s, which is pretty close to what we see in the first number. Even at 2.6GHz we should get close to 20.8 GiB/s.

我的主要问题是

关于到目前为止我发现的一些观点:

Some points regarding what I've found so far:

  • 根据对perf的广泛分析,问题似乎是由 LLC缓存加载未命中引起的,而在慢情况下却没有出现在快速情况下.我的猜测是,可能尚未对执行计算的内存进行初始化的事实意味着编译器没有义务将任何特定的值加载到内存中,但是objdump -d的输出清楚地表明:每次都运行相同的代码.好像硬件预取器是第一次活动,而不是第二次是活动,但是在每种情况下,该数组应该是世界上最可靠的预取操作.
  • 在(1)和(3)处进行的热身"呼叫始终与与在呼叫(4)中对应的第二个打印带宽一样慢.
  • 我在台式机(Intel®Xeon®CPU E5-2620 v3 @ 2.40GHz"上)获得了几乎相同的结果.
  • 结果在GCC 4.9、7.0和Clang 4.0之间基本相同.所有测试均在Debian测试(内核4.14)上运行.
  • 所有这些结果和观察结果也可以用必要的作法从必要的变通处的Dan Luu帖子中将clz_array替换为builtin_popcnt_unrolled_errata_manual.
  • According to extensive analysis with perf, the problem seems to be caused by LLC cache load misses in the slow cases that don't appear in the fast case. My guess was that maybe the fact that the memory on which we're performing the calculation hadn't been initialised meant that the compiler didn't feel obliged to load any particular values into memory, but the output of objdump -d clearly shows that the same code is being run each time. It's as though the hardware prefetcher was active the first time but not the second time, but in every case this array should be the easiest thing in the world to prefetch reliably.
  • The "warm up" calls at (1) and (3) are consistently as slow as the second printed bandwidth corresponding to call (4).
  • I've obtained much the same results on my desktop machine ("Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz").
  • Results were essentially the same between GCC 4.9, 7.0 and Clang 4.0. All tests run on Debian testing, kernel 4.14.
  • All of these results and observations can also be obtained with clz_array replaced by builtin_popcnt_unrolled_errata_manual from the Dan Luu post, mutatis mutandis.

任何帮助将不胜感激!

推荐答案

malloc使用mmap从内核获取的未初始化内存最初全部是写时复制映射到相同的全零物理页.

Uninitialized memory that malloc gets from the kernel with mmap is all initially copy-on-write mapped to the same physical page of all zeros.

因此您将获得TLB丢失,但不会出现高速缓存丢失.如果它使用的是4k页面,那么您会获得L1D匹配.如果它使用了2M的大页面,那么您只会获得L3(LLC)命中率,但是带宽仍然比DRAM好得多.

So you get TLB misses but not cache misses. If it used a 4k page, then you get L1D hits. If it used a 2M hugepage, then you only get L3 (LLC) hits, but that's still significantly better bandwidth than DRAM.

单核内存带宽通常受max_concurrency / latency限制,并且通常不能使DRAM带宽饱和. (请参阅为什么是Skylake在单线程内存吞吐量方面比Broadwell-E好得多吗?,以及以了解更多信息;在多核Xeon芯片上比在四核台式机/笔记本电脑上要糟糕得多.)

Single-core memory bandwidth is often limited by max_concurrency / latency, and often can't saturate DRAM bandwidth. (See Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?, and the "latency-bound platforms" section of this answer for more about this in; it's much worse on many-core Xeon chips than on quad-core desktop/laptops.)

您的第一次预热运行将遇到页面错误以及TLB未命中的问题.同样,在启用了Meltdown缓解的内核上,任何系统调用都将刷新整个TLB.如果要添加额外的print_stats来显示预热运行性能,那将使运行速度变慢.

Your first warm-up run will suffer from page faults as well as TLB misses. Also, on a kernel with Meltdown mitigation enabled, any system call will flush the whole TLB. If you were adding extra print_stats to show the warm-up run performance, that would have made the run after slower.

您可能希望在一次计时运行中在同一个内存上循环多次,因此您不需要触摸那么多虚拟地址空间就可以进行那么多的分页浏览.

You might want to loop multiple times over the same memory inside a timing run, so you don't need so many page-walks from touching so much virtual address space.

clock()并不是衡量性能的好方法.它以秒为单位记录时间,而不是CPU核心时钟周期.如果您运行基准测试足够长的时间,您并不需要很高的精度,但是您需要控制CPU频率以获得准确的结果.调用clock()可能会导致系统调用,该系统调用(启用Meltdown和Spectre缓解功能)将刷新TLB和分支预测. Skylake从max turbo倒退的速度可能足够慢.此后,您无需进行任何热身工作,当然,您也不能这样做,因为第一个clock()之后的任何内容都在定时间隔内.

clock() is not a great way to measure performance. It records time in seconds, not CPU core clock cycles. If you run your benchmark long enough, you don't need really high precision, but you would need to control for CPU frequency to get accurate results. Calling clock() probably results in a system call, which (with Meltdown and Spectre mitigation enabled) flushes TLBs and branch-prediction. It may be slow enough for Skylake to clock back down from max turbo. You don't do any warm-up work after that, and of course you can't because anything after the first clock() is inside the timed interval.

基于壁钟时间的东西可以使用RDTSC作为时间源而不是切换到内核模式(例如gettimeofday()),这样可以降低开销,尽管那样的话您将测量壁钟时间而不是CPU时间.如果机器空闲,则基本上是等效的,这样您的进程就不会被调度.

Something based on wall-clock time which can use RDTSC as a timesource instead of switching to kernel mode (like gettimeofday()) would be lower overhead, although then you'd be measuring wall-clock time instead of CPU time. That's basically equivalent if the machine is otherwise idle so your process doesn't get descheduled.

对于不受内存限制的某些事情,用于计算核心时钟周期的CPU性能计数器可以非常准确,而不必控制CPU频率. (尽管这些天您不必重新启动即可暂时禁用Turbo并将调节器设置为performance.)

For something that wasn't memory-bound, CPU performance counters to count core clock cycles can be very accurate, and without the inconvenience of having to control for CPU frequency. (Although these days you don't have to reboot to temporarily disable turbo and set the governor to performance.)

但是对于受内存限制的东西,更改内核频率会更改内核与内存的比率,从而使内存相对于CPU更快或更慢.

But with memory-bound stuff, changing core frequency changes the ratio of core to memory, making memory faster or slower relative to the CPU.

这篇关于如何在此循环中获得一致的高吞吐量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-19 23:03