有什么方法可以为Intel CPU直接核心到核心通信代码编写代码?

本文介绍了有什么方法可以为Intel CPU直接核心到核心通信代码编写代码?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想将线程ping通到两个CPU插槽中的所有内核，并在线程之间写入通信，而又不写回DRAM.

I want to ping threads to all cores in two CPU socket, and write communications between the threads without write back to DRAM.

如果仅在一个插槽中使用内核，那么写回高速缓存对于我的吞吐量就可以了，但是对于两个插槽，我想知道是否有更快的速度，例如在芯片网络或Intel QuickPath互连上?

Write back to cache would be fine for my throughput if I only use the cores in one sockets, but for two socket, I wonder if there is anything faster, like on chip network or Intel QuickPath Interconnect?

此外，有什么简便的方法可以利用这种功能而无需直接编写汇编代码?

What's more, is there any easy way to exploit such feature without write the assembly code directly?

参考: https://software.intel. com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/700477

推荐答案

TL:DR:否，CPU硬件已经针对一个内核存储和另一内核加载进行了优化.没有可替代的神奇的高性能低延迟方法.如果写端可以以某种方式强制将其写回L3，则可以减少读端的延迟，但是不幸的是，没有做到这一点的好方法(Tremont Atom除外，请参见下文).

TL:DR: no, CPU hardware is already optimized for one core storing, another core loading. There's no magic high-performance lower-latency method you can use instead. If the write side can force write-back to L3 somehow, that can reduce latency for the read-side, but unfortunately there's no good way to do that (except on Tremont Atom, see below).

共享的最后一级缓存已经支持了一致性流量，避免了对DRAM的写/重读.

不要被MESI图所欺骗；它们显示没有共享缓存的单级缓存.

Don't be fooled by MESI diagrams; those show single-level caches without a shared cache.

在实际的CPU中，一个核心的存储只需要写回到最后一级的缓存(现代x86中的LLC = L3)，其他核心的负载才能访问它们. L3可以容纳脏线；所有现代的x86 CPU都具有不写回的L3.

In real CPUs, stores from one core only have to write-back to last-level cache (LLC = L3 in modern x86) for loads from other cores to access them. L3 can hold dirty lines; all modern x86 CPUs have write-back L3 not write-through.

在现代的多套接字系统上，每个套接字都有其自己的内存控制器(NUMA)，因此侦听可检测何时需要在套接字之间的互连上进行cache-> cache传输.但是可以，将线程固定到相同的物理内核确实可以改善内核间/线程间的延迟. (对于AMD Zen，类似地，四个内核的群集共享一个LLC块，即使在单个套接字内，群集内/群集间的内核间延迟也很重要，因为所有内核之间没有共享一个大的LLC.)

On a modern multi-socket system, each socket has its own memory controllers (NUMA) so snooping detects when cache->cache transfers need to happen over the interconnect between sockets. But yes, pinning threads to the same physical core does improve inter-core / inter-thread latency. (Similarly for AMD Zen, where clusters of 4 cores share a chunk of LLC, within / across clusters matters for inter-core latency even within a single socket because there isn't one big LLC shared across all cores.)

你不能做得比这更好.一旦一个核心上的负载到达L3，并在另一核心的私有L1d或L2中发现线路已修改，就会生成共享请求.这就是为什么延迟时间比L3命中率高的原因:加载请求必须获得L3才能知道它不仅仅是L3命中.但是英特尔将其大型共享的 inclusiv L3缓存标签用作探听过滤器，以跟踪芯片上可能缓存了哪个内核. (这在Skylake-Xeon中有所更改；它的L3不再包含在内，甚至不包含标签，并且必须具有一些单独的探听过滤器.)

You can't do much better than this; a load on one core will generate a share request once it reaches L3 and finds the line is Modified in the private L1d or L2 of another core. This is why latency is higher than an L3 hit: the load request has to get L3 before it even knows it's not just going to be an L3 hit. But Intel uses its large shared inclusiv L3 cache tags as a snoop filter, to track which core on the chip might have it cached. (This changed in Skylake-Xeon; its L3 is no longer inclusive, not even tag-inclusive, and must have some separate snoop filter.)

另请参阅哪种缓存映射技术是在Intel Core i7处理器中使用?

有趣的事实:在，即使对于共享L2缓存的内核也是如此.

Fun fact: on Core 2 CPUs traffic between cores really was as slow as DRAM in some cases, even for cores that shared an L2 cache.

早期的Core 2 Quad CPU实际上是同一封装中的两个双核芯片，并且没有共享最后一级的缓存.情况可能更糟.如果胶水"逻辑甚至可以在不写回DRAM的情况下进行cache-> cache传输脏数据，那么类似的CPU就不会共享LLC和IDK.

Early Core 2 Quad CPUs were really two dual-core dies in the same package, and didn't share a last-level cache. That might have been even worse; some CPUs like that didn't have a shared LLC and IDK if the "glue" logic could even do cache->cache transfers of dirty data without write-back to DRAM.

但是那些日子已经过去很久了； 现代的多核和多路CPU在内核间通信方面已尽可能优化.

But those days are long past; modern multi-core and multi-socket CPUs are about as optimized as they can be for inter-core traffic.

在阅读方面，您真的无法做任何特别的事情来提高速度.

如果在写端具有cldemote，或通过其他方式将数据逐出到L3，则读端可能仅获得L3命中.但这是仅在Tremont Atom上可用

If you had cldemote on the write side, or other way to get data evicted back to L3, the read side could just get L3 hits. But that's only available on Tremont Atom

x86 MESI使缓存行等待时间失效问题是关于以下问题的另一个问题试图让写方将高速缓存行移回L3，这是由于冲突未命中.

x86 MESI invalidate cache line latency issue is another question about trying to get the write side to evict cache lines back to L3, this one via conflict misses.

clwb可能会减少读取侧延迟，但缺点是它会强制回写到DRAM的所有路径，而不仅仅是L3. (在 Skylake-Xeon上它确实会退出，例如clflushopt .希望IceLake会给我们一个真实的" clwb.)

clwb would maybe work to reduce read-side latency, but the downside is that it forces a write-back all the way to DRAM, not just L3. (And on Skylake-Xeon it does evict, like clflushopt. Hopefully IceLake will give us a "real" clwb.)

如何强制cpu内核进行刷新在c中存储缓冲区?是关于同一件事的另一个问题.

How to force cpu core to flush store buffer in c? is another question about basically the same thing.

这篇关于有什么方法可以为Intel CPU直接核心到核心通信代码编写代码?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！