本文介绍了存储指令是否在高速缓存未命中时阻止后续指令?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

比方说,我们有一个具有两个内核(C0和C1)的处理器,以及一个从地址 k 开始的缓存行,该缓存行最初由C0拥有.如果C1在 k 行的8字节插槽上发布了一条存储指令,是否会影响正在C1上执行的以下指令的吞吐量?

Let's say we have a processor with two cores (C0 and C1) and a cache line starting at address k that is owned by C0 initially. If C1 issues a store instruction on a 8-byte slot at line k, will that affect the throughput of the following instructions that are being executed on C1?

英特尔优化手册具有以下段落

The intel optimziation manual has the following paragraph

参考以下代码,

// core c0
foo();
line(k)->at(i)->store(kConstant, std::memory_order_release);
bar();
baz();

intel手册中的引言使我假设,在上面的代码中,代码的执行看起来就像商店本质上是无操作的,并且不会影响 foo结束之间的延迟.() bar()的开头.相反,对于以下代码,

The quote from the intel manual makes me assume that in the code above, the execution of the code will look as if the store was essentially a no-op, and would not impact the latency between the end of foo() and the start of bar(). In contrast, for the following code,

// core c0
foo();
bar(line(k)->at(i)->load(std::memory_order_acquire));
baz();

foo()的结束与 bar()的开始之间的延迟会受到负载的影响,因为以下代码具有负载的结果作为依赖项.

The latency between the end of foo() and the start of bar() would be impacted by the load, as the following code has the result of the load as a dependency.

这个问题主要与上述情况下的英特尔处理器(在Broadwell系列或更高版本中)如何工作有关.另外,尤其是关于如何将看起来像上面的C ++代码编译为这些处理器的程序集.

This question is mostly concerned with how intel processors (in the Broadwell family or newer) work for the case above. Also, in particular, for how C++ code that looks like the above gets compiled down to assembly for those processors.

推荐答案

通常来说,对于随后的代码不会很快读取的存储,该存储不会直接 延迟随后的代码任何现代的乱序处理器,包括英特尔.

Generally speaking, for a store that is not soon read by subsequent code, the store doesn't directly delay that subsequent code on any modern out-of-order processor, including Intel.

例如:

foo()
*x = y;
bar()

如果 foo()没有修改 x y ,并且 bar 不能从 * x ,存储是独立的,甚至可能在 foo()完成之前(甚至在启动之前)和 bar()可能在存储提交到缓存之前执行,并且 bar()甚至可能在 foo()运行时执行,等等.

If foo() doesn't modify x or y, and bar doesn't load from *x, the store is independent and may start executing even before foo() is complete (or even before it starts), and bar() may execute before the store commits to the cache, and bar() may even execute while foo() is running, etc.

虽然直接的影响很小,但这并不意味着没有间接的影响,实际上商店可能会主导执行时间.

While there is little direct impact, it doesn't meant there aren't indirect impacts and indeed the store may dominate the execution time.

如果存储未命中高速缓存,则在满足高速缓存未命中的情况下,它可能占用核心资源.通常,它还可以防止随后的存储耗尽,这可能是一个瓶颈:如果存储缓冲区已满,则前端将完全阻塞并且新指令将不再进入调度程序.

If the store misses in cache, it may tie up off-core resources while the cache miss is satisfied. It also usually prevent subsequent stores from draining, which may be a bottleneck: if the store buffer fills up, the front-end blocks entirely and new instructions no longer enter the scheduler.

最后,一切都像往常一样取决于周围代码的细节.如果该序列重复运行,并且 foo() bar()较短,则与存储相关的未命中可能会主导运行时.毕竟,缓冲不能掩盖无限数量的商店的成本.在某些时候,您会受到商店内在吞吐量的束缚.

Finally, everything depends on the details of the surrounding code, as usual. If that sequence is run repeatedly, and foo() and bar() are short, the misses related to the store may dominate the runtime. After all, buffering can't hide the cost of an unlimited number of stores. At some point you'll be bound by the intrinsic throughput of the stores.

这篇关于存储指令是否在高速缓存未命中时阻止后续指令?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-29 06:23