问题描述
我正在对代码的性能至关重要的部分进行微优化,并遇到了指令序列(采用AT& T语法):
add %rax, %rbx
mov %rdx, %rax
mov %rbx, %rdx
我以为我终于有了xchg
的用例,可以使我剃一条指令并编写:
add %rbx, %rax
xchg %rax, %rdx
但是,令我吃惊的是,我从Agner Fog的说明表中发现, >是在Sandy Bridge,Ivy Bridge,Broadwell,Haswell甚至Skylake上具有2个周期延迟的3 micro-op指令. 3个完整的微操作和2个延迟周期! 3个微操作会打乱我的4-1-1-1节奏,而2个周期的延迟会使它在最佳情况下比原始操作更糟,因为原始操作中的最后2条指令可能会并行执行.
现在...我知道CPU可能正在将指令分解为微指令,等效于:
mov %rax, %tmp
mov %rdx, %rax
mov %tmp, %rdx
其中tmp
是一个匿名内部寄存器,我想最后两个微操作可以并行运行,因此延迟为2个周期.
鉴于寄存器重命名是在这些微体系结构上发生的,因此,对我来说这样做是没有意义的.为什么寄存器重命名器不交换标签?从理论上讲,它的延迟仅为1个周期(可能为0?),并且可以表示为单个微操作,因此便宜得多.
支持高效的xchg
并非易事,并且可能不值得在CPU的各个部分所需要的额外复杂性.真正的CPU的微体系结构要比您在为其优化软件时可以使用的思维模型复杂得多.例如,推测执行使一切变得更加复杂,因为它必须能够回滚到发生异常的地步.
提高fxch
的效率对于x87的性能很重要,因为x87的堆栈特性使它(或诸如fld st(2)
之类的替代品)难以避免.编译器生成的FP代码(针对不支持SSE的目标)确实确实大量使用了fxch
.快速完成fxch
似乎是因为这样做很重要,而不是因为它很简单. 英特尔Haswell甚至放弃了对单uup fxch
的支持.它仍然是零延迟,但在HSW和更高版本上解码为2 oups(从P5中的1解码,然后通过IvyBridge的PPro解码).
xchg
通常很容易避免.在大多数情况下,您只需展开一个循环即可,现在可以将相同的值保存在不同的寄存器中.例如使用add rax, rdx
/add rdx, rax
而不是add rax, rdx
/xchg rax, rdx
的斐波那契.编译器通常不使用xchg reg,reg
,通常手写的asm也不使用. (此鸡肉/鸡蛋问题与loop
缓慢很相似().loop
对于Core2/Nehalem上的adc
循环非常有用. adc
+ dec/jnz
循环导致部分标志停顿.)
由于xchg
在以前的CPU上仍然运行缓慢,因此编译器几年来将不会开始在-mtune=generic
中使用它. 与fxch
或mov
消除不同,支持快速xchg
的设计更改不会帮助CPU更快地运行大多数现有代码,只会在当前设计上实现性能提升在极少数情况下实际上是有用的窥孔优化.
与x87不同,整数寄存器由于部分寄存器的存在而变得复杂
xchg
有4种操作数大小,其中3种使用带有REX或操作数大小前缀的相同操作码. ( xchg r8,r8
是单独的操作码,因此制作解码器可能更容易解码方式与其他方式不同).由于隐含的lock
前缀,解码器已经必须将带有内存操作数的xchg
识别为特殊字符,但是如果reg-reg将所有解码为相同的数字,则解码器的复杂度(晶体管计数+功耗)可能会更低不同操作数大小的微指令的数量.
使某些r,r
格式解码为单个uop会更加复杂,因为单uop指令必须由简单"解码器和复杂解码器来处理.因此,他们都需要能够解析xchg
并确定它是单uop还是多uop形式.
从程序员的角度来看,AMD和Intel CPU的行为有些相似,但是有许多迹象表明内部实现方式有很大的不同.例如, 英特尔移动消除仅在某些时间有效,受某种微体系结构资源的限制,但是进行移动消除的AMD CPU会100%地执行此操作(例如,推土机用于低速行驶)矢量规则).
请参阅英特尔的优化手册, 示例3-25.重新排序序列以提高零延迟MOV指令的有效性 ,他们讨论立即重写零延迟-movzx
结果以释放内部资源. (我在Haswell和Skylake上尝试了这些示例,发现进行消除运动实际上确实在很多时候都起作用,但是实际上在整个周期中它稍慢一些,而不是更快.该示例旨在说明IvyBridge的好处是,它可能会在其3个ALU端口上出现瓶颈,但HSW/SKL仅是dep链中资源冲突的瓶颈,似乎不需要为更多的movzx
指令使用ALU端口而感到困扰.)
我不知道到底需要在有限尺寸的表(?)中进行跟踪以消除运动.可能与不再需要注册文件条目时有关,因为物理寄存器文件的大小限制而不是ROB大小可能是无序窗口大小的瓶颈.交换索引可能会更困难.
;假设这可以通过重命名为物理零寄存器来实现,并且永远不需要释放该寄存器.
如果xchg
使用了与mov-elimination相同的机制,那么它有时可能也只能起作用.如果未在重命名时对其进行处理,则需要将其解码到足够的大小. (否则,当xchg
占用多于1个uop时,在issue/rename阶段将不得不插入额外的uops,就像具有索引寻址模式的微融合微片,它们不能在ROB中保持微融合,或者在为标志或高8个部分寄存器插入合并微片时保持微融合.这是一个重大的并发症,只有在xchg
是常见且重要的指令时,才值得这样做.)
请注意,xchg r32,r32
必须将两个结果零扩展到64位,,因此它不能简单地交换RAT(寄存器别名表)条目.这更像是就地截断两个寄存器.请注意,英特尔CPU永远不会消除mov same,same
.它确实已经需要在没有执行端口的情况下支持mov r32,r32
和movzx r32, r8
,因此大概它具有一些指示rax = al
的位. (是的, Intel HSW/SKL这样做,不仅限于艾维布里奇(Ivybridge),尽管艾格纳(Agner)的微拱指南对此有所说明.
我们知道P6和SnB具有这样的高零位,因为setz al
之前的xor eax,eax
避免了在读取eax时出现部分寄存器停顿的情况. HSW/SKL从不重命名al
首先是单独的,只有ah
.似乎不存在巧合,即部分寄存器重命名(AH除外)已在引入mov-elimination(Ivybridge)的同一个uarch中被丢弃.不过,一次为2个寄存器设置该位将是一种特殊情况,需要特殊支持.
xchg r64,r64
可能只是交换RAT条目,但是与r32情况不同地进行解码是另一种复杂情况.可能仍需要触发两个输入的部分寄存器合并,但是add r64,r64
也需要这样做.
还请注意, Intel uop(fxch
除外)只会产生一个寄存器结果(加标志).不触摸标志不会释放"输出插槽;它不会释放输出插槽.例如,尽管mulx r64,r64,r64
仍然需要2 uops才能在HSW/SKL上产生2个整数输出,尽管所有的工作"都是在端口1的乘法单元中完成的,与mul r64
一样,它确实会产生标志结果.)
即使只是像交换RAT条目"那样简单,构建支持每uop写入多个条目的RAT也是一种复杂情况.在单个发布组中重命名4个xchg
uops时该怎么办?在我看来,这似乎会使逻辑变得更加复杂.请记住,这必须由逻辑门/晶体管构成.即使您说使用微码陷阱处理这种特殊情况",您也必须构建整个管道,以支持该管道阶段可以采取这种例外的可能性.
单-uop fxch
需要支持在FP RAT(fRAT)中交换RAT条目(或某些其他机制),但这是与整数RAT(iRAT)分离的硬件模块.即使在fRAT(Haswell之前的版本)中将iRAT中的并发症排除在外,也是合理的.
问题/重命名复杂性绝对是功耗的问题.请注意,Skylake拓宽了许多前端(传统解码和uop缓存获取)和淘汰范围,但保留了4范围的发行/重命名限制. SKL还在后端的更多端口上添加了复制执行单元,因此问题带宽甚至在更多时间都是瓶颈,尤其是在包含负载,存储和ALU的代码中.
RAT(或整数寄存器文件IDK)甚至可能具有有限的读取端口,因为在发布/重命名许多3输入输入(如add rax, [rcx+rdx]
)时似乎存在一些前端瓶颈.我发布了一些微基准测试(此和后续文章)显示在读取大量寄存器时,Skylake比Haswell更快,例如与索引寻址模式的微融合.也许瓶颈确实存在其他一些微体系结构限制.
但是1-uop fxch
如何工作? IDK在Sandybridge/Ivybridge中的工作方式.在P6系列CPU中,基本上存在一个额外的重新映射表来支持FXCH
.这可能仅是需要的,因为P6使用每个逻辑"寄存器具有1个条目的退休注册文件,而不是物理寄存器文件(PRF).如您所说,即使冷"寄存器值也只是指向PRF条目的指针,您希望它会更简单. (来源:美国专利5,499,352 :浮点寄存器别名表FXCH和引退浮点寄存器阵列(描述Intel的P6 uarch).
(感谢Andy Glew (@krazyglew),我没想到查找专利以了解有关CPU内部结构的信息.投机执行所需的簿记.
有趣的花絮:该专利还描述了整数,并提到有一些隐藏的"逻辑寄存器保留供微码使用. (英特尔的3-uop xchg
几乎可以肯定会临时使用其中之一.)
通过查看AMD的功能,我们也许可以得到一些见识.
有趣的是,AMD在K10,Bulldozer系列,Bobcat/Jaguar和Ryzen中具有2uop xchg r,r
. (但是Jaguar xchg r8,r8
是3 oups.也许可以支持xchg ah,al
拐角处的情况,而无需特殊的uop来交换单个reg的低16).
大概两个uops都会在第一个更新RAT之前读取输入体系结构寄存器的旧值. IDK确切地说明了它是如何工作的,因为它们不一定在同一周期中发布/重命名(但是它们至少在uop流中是连续的,因此最糟糕的是第二个uop是下一个周期中的第一个uop).我不知道Haswell的2-uop fxch
是否类似地工作,或者他们是否还在做其他事情.
Ryzen是消除运动"发明之后设计的一种新体系结构,因此大概他们会尽可能地利用它. (Bulldozer系列重命名了矢量移动(但仅用于YMM矢量的低128b通道); Ryzen也是第一个对GP reg进行此操作的AMD架构.)xchg r32,r32
和r64,r64
均为零延迟(重命名),但每个仍然2 uops. (r8
和r16
需要一个执行单元,因为它们与旧值合并,而不是零扩展或复制整个reg,但是它们仍然只有2 oups).
Ryzen的fxch
是1 uop . AMD(像英特尔一样)可能并没有花费太多的晶体管来使x87更快(例如,fmul
每个时钟只有1个,并且与fadd
在同一端口上),因此大概他们能够做到这一点而又不需要太多额外的支持.他们的微编码x87指令(例如fyl2x
)比最近的Intel CPU快,所以也许Intel不在乎(至少关于微编码的x87指令).
也许AMD也可以比英特尔更容易地将xchg r64,r64
做成单芯片.甚至xchg r32,r32
都可能是单个uop,因为像Intel一样,它需要在没有执行端口的情况下支持mov r32,r32
零扩展,因此也许它可以将存在的任何高32零"位设置为支持该值. Ryzen不会在重命名时消除movzx r32, r8
,因此大概只有一个32位零的高位,没有其他宽度的位.
如果他们愿意的话,英特尔可能会便宜地做些什么:
英特尔可能会像Ryzen一样支持2uop xchg r,r
(r32,r32
和r64,r64
形式的零延迟,或者r8,r8
和r16,r16
形式的1c)核心关键部分的复杂性,例如管理寄存器别名表(RAT)的问题/重命名和退出阶段.但是也许不是,如果当第一个uop写入寄存器时,它们不能让2 uop读取寄存器的旧"值.
像xchg ah,al
这样的东西绝对是一个额外的麻烦,因为,英特尔CPU不再单独重命名部分寄存器了.
实际上在当前硬件上的
xchg
延迟时间
您对它可能在内部如何工作的猜测很好.几乎可以肯定,它使用内部临时寄存器之一(仅微码可访问).但是,您对它们如何重新排序的猜测太有限了. 实际上,一个方向的延迟为2c,另一个方向的延迟为〜1c.
00000000004000e0 <_start.loop>:
4000e0: 48 87 d1 xchg rcx,rdx # slow version
4000e3: 48 83 c1 01 add rcx,0x1
4000e7: 48 83 c1 01 add rcx,0x1
4000eb: 48 87 ca xchg rdx,rcx
4000ee: 48 83 c2 01 add rdx,0x1
4000f2: 48 83 c2 01 add rdx,0x1
4000f6: ff cd dec ebp
4000f8: 7f e6 jg 4000e0 <_start.loop>
此循环在Skylake上每次迭代以〜8.06个周期运行.反转xchg
操作数使其每次迭代以〜6.23c周期运行(在Linux上用perf stat
测量). uops发出/执行的计数器相等,因此没有消除发生.看来dst <- src
方向是较慢的方向,因为将add
uops放在该依赖项链上会使事情变得比在dst -> src
依赖项链上时要慢.
如果您想在关键路径上使用xchg reg,reg
(出于代码大小的原因?),请在关键路径上使用dst -> src
方向进行操作,因为这只有大约1c的延迟.
评论和问题中的其他副主题
Sandybridge系列解码器不同于Core2/Nehalem.它们最多可产生4微码,而不是7微码,因此模式为1-1-1-1
,2-1-1
,3-1
或4
.
还要注意,如果最后一个uop是可以宏熔丝的,则在下一个块中的第一条指令为jcc
的情况下,它们将一直挂在其上直到下一个解码周期. (当每次解码时,代码都从uop缓存中运行多次时,这是一个胜利.通常,每个时钟的解码吞吐量通常为3 uops.)
Skylake还有一个额外的简单"解码器,因此我想它最多可以执行1-1-1-1-1
,但是对于一条指令而言,> 4 uops仍需要微码ROM. Skylake也增强了uop缓存,并且如果后端(或分支未命中)首先不是瓶颈,则通常会在每个时钟问题/重命名吞吐量限制的4个融合域uops上出现瓶颈.
这似乎有点疯狂,除非您主要将自己限制在主循环内较短循环中的asm级优化上.主循环中的任何内部循环仍将通过uop缓存运行,这应该是您花费大部分时间进行优化的地方.编译器通常会做的很好,这对于人类进行大规模的大规模开发是不切实际的.当然,尝试以这样一种方式来编写C或C ++,使编译器可以很好地完成它,但是在18kB的代码中寻找像这样的微小窥孔优化似乎就像是钻了个空洞.
使用诸如idq.dsb_uops
和uops_issued.any
之类的perf计数器来查看有多少总uop来自uop缓存(DSB =解码流缓冲区之类的东西). Intel的优化手册为其他性能计数器提供了一些建议,以查找不适合uop缓存的代码,例如DSB2MITE_SWITCHES.PENALTY_CYCLES
. (MITE是旧版解码路径).在pdf中搜索DSB,以查找其中提到的一些地方.
Perf计数器将帮助您找到有潜在问题的地点,例如uops_issued.stall_cycles
高于平均水平的地区可能会受益于寻找方法来暴露更多的ILP(如果有的话),解决前端问题或减少分支机构的错误预测.
如评论中所述,单个uop最多可产生1个寄存器结果
术语:乘法结果不会进入ROB.它通过转发网络到达其他uops读取的内容,然后进入PRF.
mul %rbx
指令在解码器中解码为2 oups.他们甚至不必在同一周期内发布,更不用说在同一周期内执行了.
但是, Agner Fog的指令表仅列出了一个延迟数.事实证明,两个输入到RAX的等待时间为3个周期.根据InstlatX64在 Skylake-X .
由此得出的结论是,第二个uop依赖于第一个uop,并且存在将结果的上半部分写入体系结构寄存器的过程. port1 uop会产生完整的128b乘法结果.
在p6 uop读取前,我不知道上半部分的结果在哪里.在乘法执行单元和连接到端口6的硬件之间可能存在某种内部队列.通过调度p6 uop并依赖于下半部分的结果,可能会安排来自多个运行中的mul
指令的p6 uop以正确的顺序运行.但是,然后而不是实际使用该虚拟的下半部分输入,uop将从连接到端口6的执行单元中的队列输出中提取上半部分结果,并将其作为结果返回. (这是纯猜测的工作,但我认为这可能是一种可能的内部实现.请参阅,以获得一些较早的想法.
有趣的是,根据 Agner Fog的指令表,在Haswell上,用于mul r64
的两个指针进入端口1和6.mul r32
是3 oups,在p1 + p0156上运行. Agner并未像他在其他insns上那样说真的是2p1 + p0156
还是p1 + 2p0156
. (但是,他说mulx r32,r32,r32
在p1 + 2p056
上运行(请注意p056不包含p1).)
更奇怪的是,他说Skylake在p1 p5
上运行mulx r64,r64,r64
,但是在p1 p6
上运行mul r64
.如果这是正确的而不是错别字(这是可能的话),则可以完全排除多余的uop是上半倍乘数的可能性.
I'm doing micro-optimization on a performance critical part of my code and came across the sequence of instructions (in AT&T syntax):
add %rax, %rbx
mov %rdx, %rax
mov %rbx, %rdx
I thought I finally had a use case for xchg
which would allow me to shave an instruction and write:
add %rbx, %rax
xchg %rax, %rdx
However, to my dimay I found from Agner Fog's instruction tables, that xchg
is a 3 micro-op instruction with a 2 cycle latency on Sandy Bridge, Ivy Bridge, Broadwell, Haswell and even Skylake. 3 whole micro-ops and 2 cycles of latency! The 3 micro-ops throws off my 4-1-1-1 cadence and the 2 cycle latency makes it worse than the original in the best case since the last 2 instructions in the original might execute in parallel.
Now... I get that the CPU might be breaking the instruction into micro-ops that are equivalent to:
mov %rax, %tmp
mov %rdx, %rax
mov %tmp, %rdx
where tmp
is an anonymous internal register and I suppose the last two micro-ops could be run in parallel so the latency is 2 cycles.
Given that register renaming occurs on these micro-architectures, though, it doesn't make sense to me that this is done this way. Why wouldn't the register renamer just swap the labels? In theory, this would have a latency of only 1 cycle (possibly 0?) and could be represented as a single micro-op so it would be much cheaper.
Supporting efficient xchg
is non-trivial, and presumably not worth the extra complexity it would require in various parts of the CPU. A real CPU's microarchitecture is much more complicated than the mental model that you can use while optimizing software for it. For example, speculative execution makes everything more complicated, because it has to be able to roll back to the point where an exception occurred.
Making fxch
efficient was important for x87 performance because the stack nature of x87 makes it (or alternatives like fld st(2)
) hard to avoid. Compiler-generated FP code (for targets without SSE support) really does use fxch
a significant amount. It seems that fast fxch
was done because it was important, not because it's easy. Intel Haswell even dropped support for single-uop fxch
. It's still zero-latency, but decodes to 2 uops on HSW and later (up from 1 in P5, and PPro through IvyBridge).
xchg
is usually easy to avoid. In most cases, you can just unroll a loop so it's ok that the same value is now in a different register. e.g. Fibonacci with add rax, rdx
/ add rdx, rax
instead of add rax, rdx
/ xchg rax, rdx
. Compilers generally don't use xchg reg,reg
, and usually hand-written asm doesn't either. (This chicken/egg problem is pretty similar to loop
being slow (Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?). loop
would have been very useful for for adc
loops on Core2/Nehalem where an adc
+ dec/jnz
loop causes partial-flag stalls.)
Since xchg
is still slow-ish on previous CPUs, compilers wouldn't start using it with -mtune=generic
for several years. Unlike fxch
or mov
-elimination, a design-change to support fast xchg
wouldn't help the CPU run most existing code faster, and would only enable performance gains over the current design in rare cases where it's actually a useful peephole optimization.
Integer registers are complicated by partial-register stuff, unlike x87
There are 4 operand sizes of xchg
, 3 of which use the same opcode with REX or operand-size prefixes. (xchg r8,r8
is a separate opcode, so it's probably easier to make the decoders decode it differently from the others). The decoders already have to recognize xchg
with a memory operand as special, because of the implicit lock
prefix, but it's probably less decoder complexity (transistor-count + power) if the reg-reg forms all decode to the same number of uops for different operand sizes.
Making some r,r
forms decode to a single uop would be even more complexity, because single-uop instructions have to be handled by the "simple" decoders as well as the complex decoder. So they would all need to be able to parse xchg
and decide whether it was a single uop or multi-uop form.
AMD and Intel CPUs behave somewhat similarly from a programmer's perspective, but there are many signs that the internal implementation is vastly different. For example, Intel mov-elimination only works some of the time, limited by some kind of microarchitectural resources, but AMD CPUs that do mov-elimination do it 100% of the time (e.g. Bulldozer for the low lane of vector regs).
See Intel's optimization manual, Example 3-25. Re-ordering Sequence to Improve Effectiveness of Zero-Latency MOV Instructions, where they discuss overwriting the zero-latency-movzx
result right away to free up the internal resource sooner. (I tried the examples on Haswell and Skylake, and found that mov-elimination did in fact work significantly more of the time when doing that, but that it was actually slightly slower in total cycles, instead of faster. The example was intended to show the benefit on IvyBridge, which probably bottlenecks on its 3 ALU ports, but HSW/SKL only bottleneck on resource conflicts in the dep chains and don't seem to be bothered by needing an ALU port for more of the movzx
instructions.)
I don't know exactly what needs tracking in a limited-size table(?) for mov-elimination. Probably it's related to needing to free register-file entries as soon as possible when they're no longer needed, because Physical Register File size limits rather than ROB size can be the bottleneck for the out-of-order window size. Swapping around indices might make this harder.
xor
-zeroing is eliminated 100% of the time on Intel Sandybridge-family; it's assumed that this works by renaming to a physical zero register, and this register never needs to be freed.
If xchg
used the same mechanism that mov-elimination does, it also could probably only work some of the time. It would need to decode to enough uops to work in cases where it isn't handled at rename. (Or else the issue/rename stage would have to insert extra uops when an xchg
will take more than 1 uop, like it does when un-laminating micro-fused uops with indexed addressing modes that can't stay micro-fused in the ROB, or when inserting merging uops for flags or high-8 partial registers. But that's a significant complication that would only be worth doing if xchg
was a common and important instruction.)
Note that xchg r32,r32
has to zero-extend both results to 64 bits, so it can't be a simple swap of RAT (Register Alias Table) entries. It would be more like truncating both registers in-place. And note that Intel CPUs never eliminate mov same,same
. It does already need to support mov r32,r32
and movzx r32, r8
with no execution port, so presumably it has some bits that indicate that rax = al
or something. (And yes, Intel HSW/SKL do that, not just Ivybridge, despite what Agner's microarch guide says.)
We know P6 and SnB had upper-zeroed bits like this, because xor eax,eax
before setz al
avoids a partial-register stall when reading eax. HSW/SKL never rename al
separately in the first place, only ah
. It may not be a coincidence that partial-register renaming (other than AH) seems to have been dropped in the same uarch that introduced mov-elimination (Ivybridge). Still, setting that bit for 2 registers at once would be a special case that required special support.
xchg r64,r64
could maybe just swap the RAT entries, but decoding that differently from the r32 case is yet another complication. It might still need to trigger partial-register merging for both inputs, but add r64,r64
needs to do that, too.
Also note that an Intel uop (other than fxch
) only ever produces one register result (plus flags). Not touching flags doesn't "free up" an output slot; For example mulx r64,r64,r64
still takes 2 uops to produce 2 integer outputs on HSW/SKL, even though all the "work" is done in the multiply unit on port 1, same as with mul r64
which does produce a flag result.)
Even if it is as simple as "swap the RAT entries", building a RAT that supports writing more than one entry per uop is a complication. What to do when renaming 4 xchg
uops in a single issue group? It seems to me like it would make the logic significantly more complicated. Remember that this has to be built out of logic gates / transistors. Even if you say "handle that special case with a trap to microcode", you have to build the whole pipeline to support the possibility that that pipeline stage could take that kind of exception.
Single-uop fxch
requires support for swapping RAT entries (or some other mechanism) in the FP RAT (fRAT), but it's a separate block of hardware from the integer RAT (iRAT). Leaving out that complication in the iRAT seems reasonable even if you have it in the fRAT (pre-Haswell).
Issue/rename complexity is definitely an issue for power consumption, though. Note that Skylake widened a lot of the front-end (legacy decode and uop cache fetch), and retirement, but kept the 4-wide issue/rename limit. SKL also added replicated execution units on more port in the back-end, so issue bandwidth is a bottleneck even more of the time, especially in code with a mix of loads, stores, and ALU.
The RAT (or the integer register file, IDK) may even have limited read ports, since there seem to be some front-end bottlenecks in issuing/renaming many 3-input uops like add rax, [rcx+rdx]
. I posted some microbenchmarks (this and the follow-up post) showing Skylake being faster than Haswell when reading lots of registers, e.g. with micro-fusion of indexed addressing modes. Or maybe the bottleneck there was really some other microarchitectural limit.
But how does 1-uop fxch
work? IDK how it's done in Sandybridge / Ivybridge. In P6-family CPUs, an extra remapping table exists basically to support FXCH
. That might only be needed because P6 uses a Retirement Register File with 1 entry per "logical" register, instead of a physical register file (PRF). As you say, you'd expect it to be simpler when even "cold" register values are just a pointer to a PRF entry. (Source: US patent 5,499,352: Floating point register alias table FXCH and retirement floating point register array (describes Intel's P6 uarch).
(Thanks Andy Glew (@krazyglew), I hadn't thought of looking up patents to find out about CPU internals.) It's pretty heavy going, but may provide some insight into the bookkeeping needed for speculative execution.
Interesting tidbit: the patent describes integer as well, and mentions that there are some "hidden" logical registers which are reserved for use by microcode. (Intel's 3-uop xchg
almost certain uses one of these as a temporary.)
We might be able to get some insight from looking at what AMD does.
Interestingly, AMD has 2-uop xchg r,r
in K10, Bulldozer-family, Bobcat/Jaguar, and Ryzen. (But Jaguar xchg r8,r8
is 3 uops. Maybe to support the xchg ah,al
corner case without a special uop for swapping the low 16 of a single reg).
Presumably both uops read the old values of the input architectural registers before the first one updates the RAT. IDK exactly how this works, since they aren't necessarily issued/renamed in the same cycle (but they are at least contiguous in the uop flow, so at worst the 2nd uop is the first uop in the next cycle). I have no idea if Haswell's 2-uop fxch
works similarly, or if they're doing something else.
Ryzen is a new architecture designed after mov-elimination was "invented", so presumably they take advantage of it wherever possible. (Bulldozer-family renames vector moves (but only for the low 128b lane of YMM vectors); Ryzen is the first AMD architecture to do it for GP regs too.) xchg r32,r32
and r64,r64
are zero-latency (renamed), but still 2 uops each. (r8
and r16
need an execution unit, because they merge with the old value instead of zero-extending or copying the entire reg, but are still only 2 uops).
Ryzen's fxch
is 1 uop. AMD (like Intel) probably isn't spending a lot of transistors on making x87 fast (e.g. fmul
is only 1 per clock and on the same port as fadd
), so presumably they were able to do this without a lot of extra support. Their micro-coded x87 instructions (like fyl2x
) are faster than on recent Intel CPUs, so maybe Intel cares even less (at least about the microcoded x87 instruction).
Maybe AMD could have made xchg r64,r64
a single uop too, more easily than Intel. Maybe even xchg r32,r32
could be single uop, since like Intel it needs to support mov r32,r32
zero-extension with no execution port, so maybe it could just set whatever "upper 32 zeroed" bit exists to support that. Ryzen doesn't eliminate movzx r32, r8
at rename, so presumably there's only an upper32-zero bit, not bits for other widths.
What Intel might be able to do cheaply if they wanted to:
It's possible that Intel could support 2-uop xchg r,r
the way Ryzen does (zero latency for the r32,r32
and r64,r64
forms, or 1c for the r8,r8
and r16,r16
forms) without too much extra complexity in critical parts of the core, like the issue/rename and retirement stages that manage the Register Alias Table (RAT). But maybe not, if they can't have 2 uops read the "old" value of a register when the first uop writes it.
Stuff like xchg ah,al
is definitely a extra complication, since Intel CPUs don't rename partial registers separately anymore, except AH/BH/CH/DH.
xchg
latency in practice on current hardware
Your guess about how it might work internally is good. It almost certainly uses one of the internal temporary registers (accessible only to microcode). Your guess about how they can reorder is too limited, though. In fact, one direction has 2c latency and the other direction has ~1c latency.
00000000004000e0 <_start.loop>:
4000e0: 48 87 d1 xchg rcx,rdx # slow version
4000e3: 48 83 c1 01 add rcx,0x1
4000e7: 48 83 c1 01 add rcx,0x1
4000eb: 48 87 ca xchg rdx,rcx
4000ee: 48 83 c2 01 add rdx,0x1
4000f2: 48 83 c2 01 add rdx,0x1
4000f6: ff cd dec ebp
4000f8: 7f e6 jg 4000e0 <_start.loop>
This loop runs in ~8.06 cycles per iteration on Skylake. Reversing the xchg
operands makes it run in ~6.23c cycles per iteration (measured with perf stat
on Linux). uops issued/executed counters are equal, so no elimination happened. It looks like the dst <- src
direction is the slow one, since putting the add
uops on that dependency chain makes things slower than when they're on the dst -> src
dependency chain.
If you ever want to use xchg reg,reg
on the critical path (code-size reasons?), do it with the dst -> src
direction on the critical path, because that's only about 1c latency.
Other side-topics from comments and the question
Sandybridge-family decoders are different from Core2/Nehalem. They can produce up to 4 uops total, not 7, so the patterns are 1-1-1-1
, 2-1-1
, 3-1
, or 4
.
Also beware that if the last uop is one that can macro-fuse, they will hang onto it until the next decode cycle in case the first instruction in the next block is a jcc
. (This is a win when code runs multiple times from the uop cache for each time it's decoded. And that's still usually 3 uops per clock decode throughput.)
Skylake has an extra "simple" decoder so it can do 1-1-1-1-1
up to 4-1
I guess, but > 4 uops for one instruction still requires the microcode ROM. Skylake beefed up the uop cache, too, and can often bottleneck on the 4 fused-domain uops per clock issue/rename throughput limit if the back-end (or branch misses) aren't a bottleneck first.
That seems kinda crazy, unless you're mostly limiting yourself to asm-level optimization in shorter loops inside your main loop. Any inner loops within the main loop will still run from the uop cache, and that should probably be where you're spending most of your time optimizing. Compilers usually do a good-enough job that it's not practical for a human to do much over a large scale. Try to write your C or C++ in such a way that the compiler can do a good job with it, of course, but looking for tiny peephole optimizations like this over 18kB of code seems like going down the rabbit hole.
Use perf counters like idq.dsb_uops
vs. uops_issued.any
to see how many of your total uops came from the uop cache (DSB = Decode Stream Buffer or something). Intel's optimization manual has some suggestions for other perf counters to look at for code that doesn't fit in the uop cache, such as DSB2MITE_SWITCHES.PENALTY_CYCLES
. (MITE is the legacy-decode path). Search the pdf for DSB to find a few places it's mentioned.
Perf counters will help you find spots with potential problems, e.g. regions with higher than average uops_issued.stall_cycles
could benefit from finding ways to expose more ILP if there are any, or from solving a front-end problem, or from reducing branch-mispredicts.
As discussed in comments, a single uop produces at most 1 register result
Terminology: the multiply result doesn't go into the ROB. It goes over the forwarding network to whatever other uops read it, and goes into the PRF.
The mul %rbx
instruction decodes to 2 uops in the decoders. They don't even have to issue in the same cycle, let alone execute in the same cycle.
However, Agner Fog's instruction tables only list a single latency number. It turns out that 3 cycles is the latency from both inputs to RAX. The minimum latency for RDX is 4c, according to InstlatX64 testing on both Haswell and Skylake-X.
From this, I conclude that the 2nd uop is dependent on the first, and exists to write the high half of the result to an architectural register. The port1 uop produces a full 128b multiply result.
I don't know where the high-half result lives until the p6 uop reads it. Perhaps there's some sort of internal queue between the multiply execution unit and hardware connected to port 6. By scheduling the p6 uop with a dependency on the low-half result, that might arrange for the p6 uops from multiple in-flight mul
instructions to run in the correct order. But then instead of actually using that dummy low-half input, the uop would take the high half result from the queue output in an execution unit that's connected to port 6 and return that as the result. (This is pure guess work, but I think it's plausible as one possible internal implementation. See comments for some earlier ideas).
Interestingly, according to Agner Fog's instruction tables, on Haswell the two uops for mul r64
go to ports 1 and 6. mul r32
is 3 uops, and runs on p1 + p0156. Agner doesn't say whether that's really 2p1 + p0156
or p1 + 2p0156
like he does for some other insns. (However, he says that mulx r32,r32,r32
runs on p1 + 2p056
(note that p056 doesn't include p1).)
Even more strangely, he says that Skylake runs mulx r64,r64,r64
on p1 p5
but mul r64
on p1 p6
. If that's accurate and not a typo (which is a possibility), it pretty much rules out the possibility that the extra uop is an upper-half multiplier.
这篇关于为什么要XCHG reg,在现代Intel架构上注册3 micro-op指令?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!