现代x86 CPU使用什么高速缓存一致性解决方案？

本文介绍了现代x86 CPU使用什么高速缓存一致性解决方案？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我对高速缓存一致性系统在现代多核CPU中的功能感到困惑。我已经看到，基于侦听的协议（例如基于MESIF / MOESI的侦听协议）已在Intel和AMD处理器中使用，另一方面，基于目录的协议在多核的情况下似乎效率更高，因为它们不广播而是将消息发送到

什么是AMD或Intel处理器中的现代缓存一致性解决方案？它是基于侦听的协议（例如MOESI和MESIF），还是仅基于目录的协议？是两者的结合（基于侦听的协议，用于在同一节点内的元素之间进行通信，还是基于目录的协议，用于节点至节点之间的通信）？

解决方案

MESI是根据侦听共享总线的方式定义的，但是，不，现代CPU实际上并不以这种方式工作。每个高速缓存行的MESI状态都可以通过消息和探听过滤器（基本上是目录）进行跟踪/更新，以避免广播这些消息，这实际上是Intel（MESIF）和AMD（MOESI）所做的事情。

例如英特尔CPU中的共享包含三级缓存（）允许L3标签充当监听过滤器；除了跟踪MESI状态外，它们还记录哪个核心＃（如果有）具有线路的私有副本。

例如，带有环形总线的Sandybridge系列CPU（现代客户端芯片，服务器芯片直至Broadwell）。核心＃0读取一行。该行在核心＃1上处于修改状态。

在核心＃0的L1d和L2缓存中读取未命中，导致正在发送请求在包含该行的L3切片的环形总线上（通过一些物理地址位上的哈希函数索引）

该L3切片获取消息，检查其标签。如果此时发现tag =共享，则响应可以通过数据返回双向环形总线。

否则，L3标签告诉内核＃1拥有一行的排他性：排他性，可能已提升为Modified =脏。

L3中的L3缓存逻辑L3的那一部分将生成一条消息，要求内核＃1写回该行。

该消息到达环形公交车站内核＃1，并获取其L2或L1d来写回该行。

IDK，如果内核＃0以及相关的L3缓存切片可以直接读取一条环形总线消息，或者消息可能必须一直到达L3切片，然后 then 从那里到达核心＃0。（最坏情况下的距离=双向环基本上是整个环，而不是整个环。）

这是超级手工波浪形；不要相信我的确切细节，但是发送诸如共享请求，RFO或写回消息之类的消息的一般概念是正确的心理模型。 ，其分解步骤类似，涵盖了uop和存储缓冲区以及MESI / RFO。 p>

在类似的情况下，如果核心＃1仅获得了排他所有权但从未写过该行，则它可以不加修改地静默删除该行。（缓存中未命中的加载默认情况下会加载为互斥状态，因此，单独的存储不必为同一行执行RFO）。在那种情况下，我认为没有线路的核心毕竟必须发回一条消息来表明这一点。或者，它可以直接将消息发送到同样在环形总线上的一个内存控制器，而不是往返返回L3 slice强制它执行此操作。

显然这样的事情可能在每个内核上并行发生。（每个内核可以等待多个未决请求：单个内核内的内存级别并行性。在Intel上，L2超队列在某些微体系结构上有16个条目，而L10 LFB则有10或12个。）

四路及更高版本的系统在套接字之间具有监听过滤器；具有Broadwell和更早版本的E5-xxxx CPU的双路Intel系统只是通过QPI链接相互进行了垃圾邮件监听。（除非在双插槽系统中使用了具有四插槽功能的CPU（E7-xxxx））。多路插座很难，因为本地L3中的丢失并不一定意味着该是打DRAM的时候了。

也相关：

Kanter的SnB文章涵盖了有关英特尔环形总线设计的一些内容IIRC，尽管它主要涉及每个内核的内部。共享包容性L3是Nehalem的新功能（当英特尔开始使用 core i7品牌名称时），

-环形总线上的跳数更多具有更多内核的英特尔CPU会损害L3和DRAM延迟，因此带宽=最大并发性/延迟。

更多链接。

I am somewhat confused with what how cache coherence systems function in modern multi core CPU. I have seen that snooping based protocols like MESIF/MOESI snooping based protocols have been used in Intel and AMD processors, on the other hand directory based protocols seem to be a lot more efficient with multiple core as they don't broadcast but send messages to specific nodes.

What is the modern cache coherence solution in AMD or Intel processors, is it snooping based protocols like MOESI and MESIF, or is it only directory based protocols, or is it a combination of both (snooping based protocols for communication between elements inside the same node, and directory based for node to node communications)?

解决方案

MESI is defined in terms of snooping a shared bus, but no, modern CPUs don't actually work that way. MESI states for each cache line can be tracked / updated with messages and a snoop filter (basically a directory) to avoid broadcasting those messages, which is what Intel (MESIF) and AMD (MOESI) actually do.

e.g. the shared inclusive L3 cache in Intel CPUs (before Skylake server) lets L3 tags act as a snoop filter; as well as tracking the MESI state, they also record which core # (if any) has a private copy of a line. Which cache mapping technique is used in intel core i7 processor?

For example, a Sandybridge-family CPU with a ring bus (modern client chips, server chips up to Broadwell). Core #0 reads a line. That line is in Modified state on core #1.

read misses in L1d and L2 cache on core #0, resulting in is sending a request on the ring bus to the L3 slice that contains that line (indexing via a hash function on some physical address bits)
That slice of L3 gets the message, checks its tags. If it found tag = Shared at this point, the response could go back over the bidirectional ring bus with the data.
Otherwise, L3 tags tell it that core #1 has exclusive ownership of a line: Exclusive, may have been promoted to Modified = dirty.
L3 cache logic in that slice of L3 will generate a message to ask core #1 to write back that line.
The message arrives at the ring bus stop for core #1, and gets its L2 or L1d to write back that line.
IDK if one ring bus message can be read directly by Core #0 as well as the relevant slice of L3 cache, or if the message might have to go all the way to the L3 slice and then to core #0 from there. (Worst case distance = basically all the way around the ring, instead of half, for a bidirectional ring.)

This is super hand-wavy; do not take my word for it on the exact details, but the general concept of sending messages like share-request, RFO, or write-back, is the right mental model. BeeOnRope has an answer that with a similar breakdown into steps that covers uops and the store buffer, as well as MESI / RFO.

In a similar case, core #1 could have silently dropped the line without having modified it, if it had only gotten Exclusive ownership but never written it. (Loads that miss in cache default to loading into Exclusive state so a separate store won't have to do an RFO for the same line). In that case I assume it the core that doesn't have the line after all has to send a message back to indicate that. Or maybe it sends a message directly to one of the memory controllers that are also on the ring bus, instead of a round trip back to the L3 slice to force it to do that.

Obviously stuff like this can be happening in parallel for every core. (And each core can have multiple outstanding requests it's waiting for: memory level parallelism within a single core. On Intel, L2 superqueue has 16 entries on some microarchitectures, while there are 10 or 12 L1 LFBs.)

Quad-socket and higher systems have snoop filters between sockets; dual-socket Intel systems with E5-xxxx CPUs of Broadwell and earlier did just spam snoops to each other over the QPI links. (Unless you used a quad-socket-capable CPU (E7-xxxx) in a dual-socket system). Multi-socket is hard because missing in local L3 doesn't necessarily mean it's time to hit DRAM; the / an other socket might have the line modified.

Also related:

https://www.realworldtech.com/sandy-bridge/ Kanter's SnB write-up covers some about Intel's ring bus design, IIRC, although it's mostly about the internals of each core. The shared inclusive L3 was new in Nehalem (when Intel started using the "core i7" brand name), https://www.realworldtech.com/nehalem/
Why is Skylake so much better than Broadwell-E for single-threaded memory throughput? - more hops on the ring bus for Intel CPUs with more cores hurts L3 and DRAM latency and therefore bandwidth = max-concurrency / latency.
What is the benefit of the MOESI cache coherency protocol over MESI? some more links.

这篇关于现代x86 CPU使用什么高速缓存一致性解决方案？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！