Intel OpenMP 库通过设置 KMP_AFFINITY=scatter 显着降低了 AMD 平台上的内存带宽

本文介绍了Intel OpenMP 库通过设置 KMP_AFFINITY=scatter 显着降低了 AMD 平台上的内存带宽的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

对于内存受限的程序，使用多个线程并不总是更快，比如与内核数量相同的线程，因为线程可能会竞争内存通道.通常在双插槽机器上，线程越少越好，但我们需要设置亲和性策略，将线程分布在插槽之间以最大化内存带宽.

For memory-bound programs it is not always faster to use many threads, say the same number as the cores, since threads may compete for memory channels. Usually on a two-socket machine, less threads are better but we need to set affinity policy that distributes the threads across sockets to maximize the memory bandwidth.

Intel OpenMP 声称 KMP_AFFINITY=scatter 是为了达到这个目的，相反的值compact"是将线程尽可能靠近.我已经使用 ICC 构建了用于基准测试的 Stream 程序，这个说法在 Intel 机器上很容易得到验证.如果设置了 OMP_PROC_BIND，则会忽略 OMP_PLACES 和 OMP_PROC_BIND 等本机 OpenMP 环境变量.你会得到这样的警告:

Intel OpenMP claims that KMP_AFFINITY=scatter is to achieve this purpose, the opposite value "compact" is to place threads as close as possible. I have used ICC to build the Stream program for benchmarking and this claim is easily validated on Intel machines. And if OMP_PROC_BIND is set, the native OpenMP env vars like OMP_PLACES and OMP_PROC_BIND are ignored. You will get such a warning:

        OMP: Warning #181: OMP_PROC_BIND: ignored because KMP_AFFINITY has been defined

然而，我获得的最新 AMD EPYC 机器的基准测试显示了非常奇怪的结果.KMP_AFFINITY=scatter 提供可能的最慢内存带宽.似乎这个设置在 AMD 机器上做的恰恰相反:将线程尽可能靠近放置，这样即使每个 NUMA 节点的 L3 缓存甚至没有被充分利用.如果我明确设置 OMP_PROC_BIND=spread，它会被英特尔 OpenMP 忽略，正如上面的警告所说.

However, a benchmark on a newest AMD EPYC machine I obtained shows really bizarre results. KMP_AFFINITY=scatter gives the slowest memory bandwidth possible. It seems that this setting is doing exactly the opposite on AMD machines: placing threads as close as possible so that even the L3 cache at each NUMA node is not even fully utilized. And if I explicitly set OMP_PROC_BIND=spread, it is ignored by Intel OpenMP as the warning above says.

AMD 机器有两个插槽，每个插槽 64 个物理内核.我已经使用 128、64 和 32 个线程进行了测试，我希望它们分布在整个系统中.使用 OMP_PROC_BIND=spread，Stream 分别为我提供了 225、290 和 300 GB/s 的三元组速度.但是一旦我设置了 KMP_AFFINITY=scatter，即使 OMP_PROC_BIND=spread 仍然存在，Streams 也会提供 264、144 和 72 GB/s.

The AMD machine has two sockets, 64 physical cores per socket. I have tested using 128, 64, and 32 threads and I want them to be spread across the whole system. Using OMP_PROC_BIND=spread, Stream gives me a triad speed of 225, 290, and 300 GB/s, respectively. But once I set KMP_AFFINITY=scatter, even when OMP_PROC_BIND=spread is still present, Streams gives 264, 144, and 72 GB/s.

请注意，对于 128 个内核上的 128 个线程，设置 KMP_AFFINITY=scatter 会提供更好的性能，这进一步表明实际上所有线程都尽可能靠近放置，但根本不分散.

Notice that for 128 threads on 128 cores, setting KMP_AFFINITY=scatter gives better performance, this even further suggests that in fact all the threads are placed as close as possible, but not scattering at all.

总而言之，KMP_AFFINITY=scatter 在 AMD 机器上显示完全相反的(糟糕的方式)行为，它甚至会覆盖原生 OpenMP 环境，无论 CPU 品牌如何.整个情况听起来有点可疑，因为众所周知，ICC 检测 CPU 品牌并使用 MKL 中的 CPU 调度程序在非 Intel 机器上启动较慢的代码.那么为什么 ICC 不能在检测到非 Intel CPU 时简单地禁用 KMP_AFFINITY 并恢复 OMP_PROC_BIND 呢?

In summary, KMP_AFFINITY=scatter displays completely opposite (in the bad way) behavior on AMD machines and it will even overwrite native OpenMP environment regardless the CPU brand. The whole situation sounds a bit fishy, since it is well known that ICC detects the CPU brand and uses the CPU dispatcher in MKL to launch the slower code on non-Intel machines. So why can't ICC simply disable KMP_AFFINITY and restore OMP_PROC_BIND if it detects a non-Intel CPU?

这是某人已知的问题吗?或者有人可以验证我的发现?

Is this a known issue to someone? Or someone can validate my findings?

为了提供更多背景信息，我是商业计算流体动力学程序的开发人员，不幸的是我们将我们的程序与 ICC OpenMP 库链接起来，并且默认设置了 KMP_AFFINITY=scatter 因为在 CFD 中我们必须解决大规模稀疏线性系统，这部分非常受内存限制.我发现通过设置 KMP_AFFINITY=scatter，我们的程序变得比程序在 AMD 机器上可以达到的实际速度慢 4 倍(使用 32 个线程时).

To give more context, I am a developer of commercial computational fluid dynamics program and unfortunately we links our program with ICC OpenMP library and KMP_AFFINITY=scatter is set by default because in CFD we must solve large-scale sparse linear systems and this part is extremely memory-bound. I found that with setting KMP_AFFINITY=scatter, our program becomes 4X slower (when using 32 threads) than the actual speed the program can achieve on the AMD machine.

更新:

现在使用 hwloc-ps 我可以确认 KMP_AFFINITY=scatter 实际上是在执行紧凑"操作；在我的 AMD threadripper 3 机器上.我附上了 lstopo 结果.我用 16 个线程运行我的 CFD 程序(由 ICC2017 构建).OPM_PROC_BIND=spread 可以在每个 CCX 中放置一个线程，以便充分利用 L3 缓存.Hwloc-ps -l -t 给出:

Now using hwloc-ps I can confirm that KMP_AFFINITY=scatter is actually doing "compact" on my AMD threadripper 3 machine. I have attached the lstopo result. I run my CFD program (built by ICC2017) with 16 threads. OPM_PROC_BIND=spread can place one thread in each CCX so that L3 cache is fully utilized. Hwloc-ps -l -t gives:

在设置 KMP_AFFINITY=scatter 时，我得到了

While setting KMP_AFFINITY=scatter, I got

我将尝试最新的 ICC/Clang OpenMP 运行时，看看它是如何工作的.

I will try the latest ICC/Clang OpenMP runtime and see how it works.

推荐答案

TL;DR:不要使用KMP_AFFINITY.它不便携.首选OMP_PROC_BIND(不能与KMP_AFFINITY 同时使用).您可以将其与 OMP_PLACES 混合使用以手动将线程绑定到内核.此外，numactl 应该用于控制内存通道绑定或更一般的 NUMA 效果.

TL;DR: Do not use KMP_AFFINITY. It is not portable. Prefer OMP_PROC_BIND (it cannot be used with KMP_AFFINITY at the same time). You can mix it with OMP_PLACES to bind threads to cores manually. Moreover, numactl should be used to control the memory channel binding or more generally NUMA effects.

长答案:

线程绑定:OMP_PLACES 可用于将每个线程绑定到特定核心(减少上下文切换和 NUMA 问题).OMP_PROC_BIND 和 KMP_AFFINITY 理论上应该可以正确地做到这一点，但实际上，它们在某些系统上无法做到.请注意，OMP_PROC_BIND 和 KMP_AFFINITY 是独占选项:它们应该不一起使用(OMP_PROC_BIND 是一个新的便携式替代品旧的 KMP_AFFINITY 环境变量).当内核的拓扑结构从一台机器改变到另一台机器时，您可以使用hwloc 工具获取OMP_PLACES 所需的PU id 列表.更特别的是 hwloc-calc 获取列表和 hwloc-ls 检查 CPU 拓扑.所有线程都应该单独绑定，这样就不可能移动.您可以使用 hwloc-ps 检查线程的绑定.

Thread binding: OMP_PLACES can be used to bound each thread to a specific core (reducing context switches and NUMA issues). OMP_PROC_BIND and KMP_AFFINITY should theoretically do that correctly, but in practice, they fail to do so on some systems. Note that OMP_PROC_BIND and KMP_AFFINITY are exclusive option: they should not be used together (OMP_PROC_BIND is a new portable replacement of the older KMP_AFFINITY environment variable). As the topology of the core change from one machine to another, you can use the hwloc tool to get the list of the PU ids required by OMP_PLACES. More especially hwloc-calc to get the list and hwloc-ls to check the CPU topology. All threads should be bound separately so that no move is possible. You can check the binding of the threads with hwloc-ps.

NUMA 效果:AMD 处理器是通过组装多个 CCX 并通过高带宽连接(AMD Infinity Fabric)连接在一起而构建的.因此，AMD 处理器是NUMA 系统.如果不加以考虑，NUMA 效应可能会导致性能显着下降.numactl 工具旨在控制/减轻 NUMA 影响:可以使用 --membind 选项将进程绑定到内存通道，并且可以将内存分配策略设置为 --interleave(或者 --localalloc 如果进程是 NUMA-aware).理想情况下，进程/线程应该只处理在它们本地内存通道上分配和第一次接触的数据.如果您想在给定的 CCX 上测试配置，您可以使用 --physcpubind 和 --cpunodebind.

NUMA effects: AMD processors are built by assembling multiple CCX connected together with a high-bandwidth connection (AMD Infinity Fabric). Because of that, AMD processors are NUMA systems. If not taken into account, NUMA effects can result in a significant drop in performance. The numactl tool is designed to control/mitigate NUMA effects: processes can be bound to memory channels using the --membind option and the memory allocation policy can be set to --interleave (or --localalloc if the process is NUMA-aware). Ideally, processes/threads should only work on data allocated and first-touched on they local memory channels. If you want to test a configuration on a given CCX you can play with --physcpubind and --cpunodebind.

我的猜测是，当设置 KMP_AFFINITY=scatter 时，Intel/Clang 运行时不会执行良好的线程绑定，因为错误的 PU 映射(可能来自操作系统错误、运行时错误或错误的用户/管理员设置).可能是由于 CCX(因为包含多个 NUMA 节点的主流处理器非常少见).
在 AMD 处理器上，由于数据通过(相当慢的)Infinity Fabric 互连，并且可能是由于其饱和，访问另一个 CCX 内存的线程通常会支付额外的大量成本以及记忆渠道之一.我建议您不要相信 OpenMP 运行时的自动线程绑定(使用 OMP_PROC_BIND=TRUE)，而是手动执行线程/内存绑定，然后在需要时报告错误.

My guess is that the Intel/Clang runtime does not perform a good thread binding when KMP_AFFINITY=scatter is set because of a bad PU mapping (which could come from a OS bug, a runtime bug or bad user/admin settings). Probably due to the CCX (since mainstream processors containing multiple NUMA nodes were quite rare).
On AMD processors, threads accessing memory of another CCX usually pay an additional significant cost due to data moving through the (quite-slow) Infinity Fabric interconnect and possibly due to its saturation as well as the one of memory channels. I advise you to not trust OpenMP runtime's automatic thread binding (use OMP_PROC_BIND=TRUE), to rather perform the thread/memory bindings manually and then to report bugs if needed.

下面是一个生成的命令行示例，用于运行您的应用程序:numactl --localalloc OMP_PROC_BIND=TRUE OMP_PLACES={0}、{1}、{2}、{3}、{4}、{5}、{6}、{7}"./app

Here is an example of a resulting command line so as to run your application:numactl --localalloc OMP_PROC_BIND=TRUE OMP_PLACES="{0},{1},{2},{3},{4},{5},{6},{7}" ./app

PS:注意 PU/核心 ID 和逻辑/物理 ID.

PS: be careful about PU/core IDs and logical/physical IDs.

这篇关于Intel OpenMP 库通过设置 KMP_AFFINITY=scatter 显着降低了 AMD 平台上的内存带宽的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！