本文介绍了为什么要XCHG reg,在现代Intel架构上注册3 micro-op指令?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在对代码的性能至关重要的部分进行微优化,并遇到了指令序列(采用AT& T语法):

add %rax, %rbx
mov %rdx, %rax
mov %rbx, %rdx

我以为我终于有了xchg的用例,可以使我剃一条指令并编写:

add  %rbx, %rax
xchg %rax, %rdx

但是,令我吃惊的是,我从Agner Fog的说明表中发现, >是在Sandy Bridge,Ivy Bridge,Broadwell,Haswell甚至Skylake上具有2个周期延迟的3 micro-op指令. 3个完整的微操作和2个延迟周期! 3个微操作会打乱我的4-1-1-1节奏,而2个周期的延迟会使它在最佳情况下比原始操作更糟,因为原始操作中的最后2条指令可能会并行执行.

现在...我知道CPU可能正在将指令分解为微指令,等效于:

mov %rax, %tmp
mov %rdx, %rax
mov %tmp, %rdx 

其中tmp是一个匿名内部寄存器,我想最后两个微操作可以并行运行,因此延迟为2个周期.

鉴于寄存器重命名是在这些微体系结构上发生的,因此,对我来说这样做是没有意义的.为什么寄存器重命名器不交换标签?从理论上讲,它的延迟仅为1个周期(可能为0?),并且可以表示为单个微操作,因此便宜得多.

解决方案

支持高效的xchg并非易事,并且可能不值得在CPU的各个部分所需要的额外复杂性.真正的CPU的微体系结构要比您在为其优化软件时可以使用的思维模型复杂得多.例如,推测执行使一切变得更加复杂,因为它必须能够回滚到发生异常的地步.

提高fxch的效率对于x87的性能很重要,因为x87的堆栈特性使它(或诸如fld st(2)之类的替代品)难以避免.编译器生成的FP代码(针对不支持SSE的目标)确实确实大量使用了fxch.快速完成fxch似乎是因为这样做很重要,而不是因为它很简单. 英特尔Haswell甚至放弃了对单uup fxch 的支持.它仍然是零延迟,但在HSW和更高版本上解码为2 oups(从P5中的1解码,然后通过IvyBridge的PPro解码).

xchg通常很容易避免.在大多数情况下,您只需展开一个循环即可,现在可以将相同的值保存在不同的寄存器中.例如使用add rax, rdx/add rdx, rax而不是add rax, rdx/xchg rax, rdx的斐波那契.编译器通常不使用xchg reg,reg,通常手写的asm也不使用. (此鸡肉/鸡蛋问题与loop缓慢很相似().loop对于Core2/Nehalem上的adc循环非常有用. adc + dec/jnz循环导致部分标志停顿.)

由于xchg在以前的CPU上仍然运行缓慢,因此编译器几年来将不会开始在-mtune=generic中使用它. fxchmov消除不同,支持快速xchg的设计更改不会帮助CPU更快地运行大多数现有代码,只会在当前设计上实现性能提升在极少数情况下实际上是有用的窥孔优化.


与x87不同,整数寄存器由于部分寄存器的存在而变得复杂

xchg有4种操作数大小,其中3种使用带有REX或操作数大小前缀的相同操作码. ( xchg r8,r8是单独的操作码,因此制作解码器可能更容易解码方式与其他方式不同).由于隐含的lock前缀,解码器已经必须将带有内存操作数的xchg识别为特殊字符,但是如果reg-reg将所有解码为相同的数字,则解码器的复杂度(晶体管计数+功耗)可能会更低不同操作数大小的微指令的数量.

使某些r,r格式解码为单个uop会更加复杂,因为单uop指令必须由简单"解码器和复杂解码器来处理.因此,他们都需要能够解析xchg并确定它是单uop还是多uop形式.


从程序员的角度来看,AMD和Intel CPU的行为有些相似,但是有许多迹象表明内部实现方式有很大的不同.例如, 英特尔移动消除仅在某些时间有效,受某种微体系结构资源的限制,但是进行移动消除的AMD CPU会100%地执行此操作(例如,推土机用于低速行驶)矢量规则).

请参阅英特尔的优化手册, 示例3-25.重新排序序列以提高零延迟MOV指令的有效性 ,他们讨论立即重写零延迟-movzx结果以释放内部资源. (我在Haswell和Skylake上尝试了这些示例,发现进行消除运动实际上确实在很多时候都起作用,但是实际上在整个周期中它稍慢一些,而不是更快.该示例旨在说明IvyBridge的好处是,它可能会在其3个ALU端口上出现瓶颈,但HSW/SKL仅是dep链中资源冲突的瓶颈,似乎不需要为更多的movzx指令使用ALU端口而感到困扰.)

我不知道到底需要在有限尺寸的表(?)中进行跟踪以消除运动.可能与不再需要注册文件条目时有关,因为物理寄存器文件的大小限制而不是ROB大小可能是无序窗口大小的瓶颈.交换索引可能会更困难.

;假设这可以通过重命名为物理零寄存器来实现,并且永远不需要释放该寄存器.

如果xchg使用了与mov-elimination相同的机制,那么它有时可能也只能起作用.如果未在重命名时对其进行处理,则需要将其解码到足够的大小. (否则,当xchg占用多于1个uop时,在issue/rename阶段将不得不插入额外的uops,就像具有索引寻址模式的微融合微片,它们不能在ROB中保持微融合,或者在为标志或高8个部分寄存器插入合并微片时保持微融合.这是一个重大的并发症,只有在xchg是常见且重要的指令时,才值得这样做.)

请注意,xchg r32,r32必须将两个结果零扩展到64位,,因此它不能简单地交换RAT(寄存器别名表)条目.这更像是就地截断两个寄存器.请注意,英特尔CPU永远不会消除mov same,same.它确实已经需要在没有执行端口的情况下支持mov r32,r32movzx r32, r8,因此大概它具有一些指示rax = al的位. (是的, Intel HSW/SKL这样做,不仅限于艾维布里奇(Ivybridge),尽管艾格纳(Agner)的微拱指南对此有所说明.

我们知道P6和SnB具有这样的高零位,因为setz al之前的xor eax,eax避免了在读取eax时出现部分寄存器停顿的情况. HSW/SKL从不重命名al首先是单独的,只有ah .似乎不存在巧合,即部分寄存器重命名(AH除外)已在引入mov-elimination(Ivybridge)的同一个uarch中被丢弃.不过,一次为2个寄存器设置该位将是一种特殊情况,需要特殊支持.

xchg r64,r64可能只是交换RAT条目,但是与r32情况不同地进行解码是另一种复杂情况.可能仍需要触发两个输入的部分寄存器合并,但是add r64,r64也需要这样做.

还请注意, Intel uop(fxch除外)只会产生一个寄存器结果(加标志).不触摸标志不会释放"输出插槽;它不会释放输出插槽.例如,尽管mulx r64,r64,r64仍然需要2 uops才能在HSW/SKL上产生2个整数输出,尽管所有的工作"都是在端口1的乘法单元中完成的,与mul r64一样,它确实会产生标志结果.)

即使只是像交换RAT条目"那样简单,构建支持每uop写入多个条目的RAT也是一种复杂情况.在单个发布组中重命名4个xchg uops时该怎么办?在我看来,这似乎会使逻辑变得更加复杂.请记住,这必须由逻辑门/晶体管构成.即使您说使用微码陷阱处理这种特殊情况",您也必须构建整个管道,以支持该管道阶段可以采取这种例外的可能性.

单-uop fxch需要支持在FP RAT(fRAT)中交换RAT条目(或某些其他机制),但这是与整数RAT(iRAT)分离的硬件模块.即使在fRAT(Haswell之前的版本)中将iRAT中的并发症排除在外,也是合理的.

问题/重命名复杂性绝对是功耗的问题.请注意,Skylake拓宽了许多前端(传统解码和uop缓存获取)和淘汰范围,但保留了4范围的发行/重命名限制. SKL还在后端的更多端口上添加了复制执行单元,因此问题带宽甚至在更多时间都是瓶颈,尤其是在包含负载,存储和ALU的代码中.

RAT(或整数寄存器文件IDK)甚至可能具有有限的读取端口,因为在发布/重命名许多3输入输入(如add rax, [rcx+rdx])时似乎存在一些前端瓶颈.我发布了一些微基准测试(和后续文章)显示在读取大量寄存器时,Skylake比Haswell更快,例如与索引寻址模式的微融合.也许瓶颈确实存在其他一些微体系结构限制.


但是1-uop fxch如何工作? IDK在Sandybridge/Ivybridge中的工作方式.在P6系列CPU中,基本上存在一个额外的重新映射表来支持FXCH.这可能仅是需要的,因为P6使用每个逻辑"寄存器具有1个条目的退休注册文件,而不是物理寄存器文件(PRF).如您所说,即使冷"寄存器值也只是指向PRF条目的指针,您希望它会更简单. (来源:美国专利5,499,352 :浮点寄存器别名表FXCH和引退浮点寄存器阵列(描述Intel的P6 uarch).

(感谢Andy Glew (@krazyglew),我没想到查找专利以了解有关CPU内部结构的信息.投机执行所需的簿记.

有趣的花絮:该专利还描述了整数,并提到有一些隐藏的"逻辑寄存器保留供微码使用. (英特尔的3-uop xchg几乎可以肯定会临时使用其中之一.)


通过查看AMD的功能,我们也许可以得到一些见识.

有趣的是,AMD在K10,Bulldozer系列,Bobcat/Jaguar和Ryzen中具有2uop xchg r,r . (但是Jaguar xchg r8,r8是3 oups.也许可以支持xchg ah,al拐角处的情况,而无需特殊的uop来交换单个reg的低16).

大概两个uops都会在第一个更新RAT之前读取输入体系结构寄存器的旧值. IDK确切地说明了它是如何工作的,因为它们不一定在同一周期中发布/重命名(但是它们至少在uop流中是连续的,因此最糟糕的是第二个uop是下一个周期中的第一个uop).我不知道Haswell的2-uop fxch是否类似地工作,或者他们是否还在做其他事情.

Ryzen是消除运动"发明之后设计的一种新体系结构,因此大概他们会尽可能地利用它. (Bulldozer系列重命名了矢量移动(但仅用于YMM矢量的低128b通道); Ryzen也是第一个对GP reg进行此操作的AMD架构.)xchg r32,r32r64,r64均为零延迟(重命名),但每个仍然2 uops. (r8r16需要一个执行单元,因为它们与旧值合并,而不是零扩展或复制整个reg,但是它们仍然只有2 oups).

Ryzen的fxch是1 uop . AMD(像英特尔一样)可能并没有花费太多的晶体管来使x87更快(例如,fmul每个时钟只有1个,并且与fadd在同一端口上),因此大概他们能够做到这一点而又不需要太多额外的支持.他们的微编码x87指令(例如fyl2x)比最近的Intel CPU快,所以也许Intel不在乎(至少关于微编码的x87指令).

也许AMD也可以比英特尔更容易地将xchg r64,r64做成单芯片.甚至xchg r32,r32都可能是单个uop,因为像Intel一样,它需要在没有执行端口的情况下支持mov r32,r32零扩展,因此也许它可以将存在的任何高32零"位设置为支持该值. Ryzen不会在重命名时消除movzx r32, r8,因此大概只有一个32位零的高位,没有其他宽度的位.


如果他们愿意的话,英特尔可能会便宜地做些什么:

英特尔可能会像Ryzen一样支持2uop xchg r,r(r32,r32r64,r64形式的零延迟,或者r8,r8r16,r16形式的1c)核心关键部分的复杂性,例如管理寄存器别名表(RAT)的问题/重命名和退出阶段.但是也许不是,如果当第一个uop写入寄存器时,它们不能让2 uop读取寄存器的旧"值.

xchg ah,al这样的东西绝对是一个额外的麻烦,因为,英特尔CPU不再单独重命名部分寄存器了.


实际上在当前硬件上的

xchg延迟时间

您对它可能在内部如何工作的猜测很好.几乎可以肯定,它使用内部临时寄存器之一(仅微码可访问).但是,您对它们如何重新排序的猜测太有限了. 实际上,一个方向的延迟为2c,另一个方向的延迟为〜1c.

00000000004000e0 <_start.loop>:
  4000e0:       48 87 d1                xchg   rcx,rdx   # slow version
  4000e3:       48 83 c1 01             add    rcx,0x1
  4000e7:       48 83 c1 01             add    rcx,0x1
  4000eb:       48 87 ca                xchg   rdx,rcx
  4000ee:       48 83 c2 01             add    rdx,0x1
  4000f2:       48 83 c2 01             add    rdx,0x1
  4000f6:       ff cd                   dec    ebp
  4000f8:       7f e6                   jg     4000e0 <_start.loop>

此循环在Skylake上每次迭代以〜8.06个周期运行.反转xchg操作数使其每次迭代以〜6.23c周期运行(在Linux上用perf stat测量). uops发出/执行的计数器相等,因此没有消除发生.看来dst <- src方向是较慢的方向,因为将add uops放在该依赖项链上会使事情变得比在dst -> src依赖项链上时要慢.

如果您想在关键路径上使用xchg reg,reg(出于代码大小的原因?),请在关键路径上使用dst -> src方向进行操作,因为这只有大约1c的延迟.


评论和问题中的其他副主题

Sandybridge系列解码器不同于Core2/Nehalem.它们最多可产生4微码,而不是7微码,因此模式为1-1-1-12-1-13-14.

还要注意,如果最后一个uop是可以宏熔丝的,则在下一个块中的第一条指令为jcc的情况下,它们将一直挂在其上直到下一个解码周期. (当每次解码时,代码都从uop缓存中运行多次时,这是一个胜利.通常,每个时钟的解码吞吐量通常为3 uops.)

Skylake还有一个额外的简单"解码器,因此我想它最多可以执行1-1-1-1-1,但是对于一条指令而言,> 4 uops仍需要微码ROM. Skylake也增强了uop缓存,并且如果后端(或分支未命中)首先不是瓶颈,则通常会在每个时钟问题/重命名吞吐量限制的4个融合域uops上出现瓶颈.

这似乎有点疯狂,除非您主要将自己限制在主循环内较短循环中的asm级优化上.主循环中的任何内部循环仍将通过uop缓存运行,这应该是您花费大部分时间进行优化的地方.编译器通常会做的很好,这对于人类进行大规模的大规模开发是不切实际的.当然,尝试以这样一种方式来编写C或C ++,使编译器可以很好地完成它,但是在18kB的代码中寻找像这样的微小窥孔优化似乎就像是钻了个空洞.

使用诸如idq.dsb_uopsuops_issued.any之类的perf计数器来查看有多少总uop来自uop缓存(DSB =解码流缓冲区之类的东西). Intel的优化手册为其他性能计数器提供了一些建议,以查找不适合uop缓存的代码,例如DSB2MITE_SWITCHES.PENALTY_CYCLES. (MITE是旧版解码路径).在pdf中搜索DSB,以查找其中提到的一些地方.

Perf计数器将帮助您找到有潜在问题的地点,例如uops_issued.stall_cycles高于平均水平的地区可能会受益于寻找方法来暴露更多的ILP(如果有的话),解决前端问题或减少分支机构的错误预测.


如评论中所述,单个uop最多可产生1个寄存器结果

术语:乘法结果不会进入ROB.它通过转发网络到达其他uops读取的内容,然后进入PRF.

mul %rbx指令在解码器中解码为2 oups.他们甚至不必在同一周期内发布,更不用说在同一周期内执行了.

但是, Agner Fog的指令表仅列出了一个延迟数.事实证明,两个输入到RAX的等待时间为3个周期.根据InstlatX64在 Skylake-X .

由此得出的结论是,第二个uop依赖于第一个uop,并且存在将结果的上半部分写入体系结构寄存器的过程. port1 uop会产生完整的128b乘法结果.

在p6 uop读取前,我不知道上半部分的结果在哪里.在乘法执行单元和连接到端口6的硬件之间可能存在某种内部队列.通过调度p6 uop并依赖于下半部分的结果,可能会安排来自多个运行中的mul指令的p6 uop以正确的顺序运行.但是,然后而不是实际使用该虚拟的下半部分输入,uop将从连接到端口6的执行单元中的队列输出中提取上半部分结果,并将其作为结果返回. (这是猜测的工作,但我认为这可能是一种可能的内部实现.请参阅,以获得一些较早的想法.

有趣的是,根据 Agner Fog的指令表,在Haswell上,用于mul r64的两个指针进入端口1和6.mul r32是3 oups,在p1 + p0156上运行. Agner并未像他在其他insns上那样说真的是2p1 + p0156还是p1 + 2p0156. (但是,他说mulx r32,r32,r32p1 + 2p056上运行(请注意p056不包含p1).)

更奇怪的是,他说Skylake在p1 p5上运行mulx r64,r64,r64,但是在p1 p6上运行mul r64.如果这是正确的而不是错别字(这是可能的话),则可以完全排除多余的uop是上半倍乘数的可能性.

I'm doing micro-optimization on a performance critical part of my code and came across the sequence of instructions (in AT&T syntax):

add %rax, %rbx
mov %rdx, %rax
mov %rbx, %rdx

I thought I finally had a use case for xchg which would allow me to shave an instruction and write:

add  %rbx, %rax
xchg %rax, %rdx

However, to my dimay I found from Agner Fog's instruction tables, that xchg is a 3 micro-op instruction with a 2 cycle latency on Sandy Bridge, Ivy Bridge, Broadwell, Haswell and even Skylake. 3 whole micro-ops and 2 cycles of latency! The 3 micro-ops throws off my 4-1-1-1 cadence and the 2 cycle latency makes it worse than the original in the best case since the last 2 instructions in the original might execute in parallel.

Now... I get that the CPU might be breaking the instruction into micro-ops that are equivalent to:

mov %rax, %tmp
mov %rdx, %rax
mov %tmp, %rdx 

where tmp is an anonymous internal register and I suppose the last two micro-ops could be run in parallel so the latency is 2 cycles.

Given that register renaming occurs on these micro-architectures, though, it doesn't make sense to me that this is done this way. Why wouldn't the register renamer just swap the labels? In theory, this would have a latency of only 1 cycle (possibly 0?) and could be represented as a single micro-op so it would be much cheaper.

解决方案

Supporting efficient xchg is non-trivial, and presumably not worth the extra complexity it would require in various parts of the CPU. A real CPU's microarchitecture is much more complicated than the mental model that you can use while optimizing software for it. For example, speculative execution makes everything more complicated, because it has to be able to roll back to the point where an exception occurred.

Making fxch efficient was important for x87 performance because the stack nature of x87 makes it (or alternatives like fld st(2)) hard to avoid. Compiler-generated FP code (for targets without SSE support) really does use fxch a significant amount. It seems that fast fxch was done because it was important, not because it's easy. Intel Haswell even dropped support for single-uop fxch. It's still zero-latency, but decodes to 2 uops on HSW and later (up from 1 in P5, and PPro through IvyBridge).

xchg is usually easy to avoid. In most cases, you can just unroll a loop so it's ok that the same value is now in a different register. e.g. Fibonacci with add rax, rdx / add rdx, rax instead of add rax, rdx / xchg rax, rdx. Compilers generally don't use xchg reg,reg, and usually hand-written asm doesn't either. (This chicken/egg problem is pretty similar to loop being slow (Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?). loop would have been very useful for for adc loops on Core2/Nehalem where an adc + dec/jnz loop causes partial-flag stalls.)

Since xchg is still slow-ish on previous CPUs, compilers wouldn't start using it with -mtune=generic for several years. Unlike fxch or mov-elimination, a design-change to support fast xchg wouldn't help the CPU run most existing code faster, and would only enable performance gains over the current design in rare cases where it's actually a useful peephole optimization.


Integer registers are complicated by partial-register stuff, unlike x87

There are 4 operand sizes of xchg, 3 of which use the same opcode with REX or operand-size prefixes. (xchg r8,r8 is a separate opcode, so it's probably easier to make the decoders decode it differently from the others). The decoders already have to recognize xchg with a memory operand as special, because of the implicit lock prefix, but it's probably less decoder complexity (transistor-count + power) if the reg-reg forms all decode to the same number of uops for different operand sizes.

Making some r,r forms decode to a single uop would be even more complexity, because single-uop instructions have to be handled by the "simple" decoders as well as the complex decoder. So they would all need to be able to parse xchg and decide whether it was a single uop or multi-uop form.


AMD and Intel CPUs behave somewhat similarly from a programmer's perspective, but there are many signs that the internal implementation is vastly different. For example, Intel mov-elimination only works some of the time, limited by some kind of microarchitectural resources, but AMD CPUs that do mov-elimination do it 100% of the time (e.g. Bulldozer for the low lane of vector regs).

See Intel's optimization manual, Example 3-25. Re-ordering Sequence to Improve Effectiveness of Zero-Latency MOV Instructions, where they discuss overwriting the zero-latency-movzx result right away to free up the internal resource sooner. (I tried the examples on Haswell and Skylake, and found that mov-elimination did in fact work significantly more of the time when doing that, but that it was actually slightly slower in total cycles, instead of faster. The example was intended to show the benefit on IvyBridge, which probably bottlenecks on its 3 ALU ports, but HSW/SKL only bottleneck on resource conflicts in the dep chains and don't seem to be bothered by needing an ALU port for more of the movzx instructions.)

I don't know exactly what needs tracking in a limited-size table(?) for mov-elimination. Probably it's related to needing to free register-file entries as soon as possible when they're no longer needed, because Physical Register File size limits rather than ROB size can be the bottleneck for the out-of-order window size. Swapping around indices might make this harder.

xor-zeroing is eliminated 100% of the time on Intel Sandybridge-family; it's assumed that this works by renaming to a physical zero register, and this register never needs to be freed.

If xchg used the same mechanism that mov-elimination does, it also could probably only work some of the time. It would need to decode to enough uops to work in cases where it isn't handled at rename. (Or else the issue/rename stage would have to insert extra uops when an xchg will take more than 1 uop, like it does when un-laminating micro-fused uops with indexed addressing modes that can't stay micro-fused in the ROB, or when inserting merging uops for flags or high-8 partial registers. But that's a significant complication that would only be worth doing if xchg was a common and important instruction.)

Note that xchg r32,r32 has to zero-extend both results to 64 bits, so it can't be a simple swap of RAT (Register Alias Table) entries. It would be more like truncating both registers in-place. And note that Intel CPUs never eliminate mov same,same. It does already need to support mov r32,r32 and movzx r32, r8 with no execution port, so presumably it has some bits that indicate that rax = al or something. (And yes, Intel HSW/SKL do that, not just Ivybridge, despite what Agner's microarch guide says.)

We know P6 and SnB had upper-zeroed bits like this, because xor eax,eax before setz al avoids a partial-register stall when reading eax. HSW/SKL never rename al separately in the first place, only ah. It may not be a coincidence that partial-register renaming (other than AH) seems to have been dropped in the same uarch that introduced mov-elimination (Ivybridge). Still, setting that bit for 2 registers at once would be a special case that required special support.

xchg r64,r64 could maybe just swap the RAT entries, but decoding that differently from the r32 case is yet another complication. It might still need to trigger partial-register merging for both inputs, but add r64,r64 needs to do that, too.

Also note that an Intel uop (other than fxch) only ever produces one register result (plus flags). Not touching flags doesn't "free up" an output slot; For example mulx r64,r64,r64 still takes 2 uops to produce 2 integer outputs on HSW/SKL, even though all the "work" is done in the multiply unit on port 1, same as with mul r64 which does produce a flag result.)

Even if it is as simple as "swap the RAT entries", building a RAT that supports writing more than one entry per uop is a complication. What to do when renaming 4 xchg uops in a single issue group? It seems to me like it would make the logic significantly more complicated. Remember that this has to be built out of logic gates / transistors. Even if you say "handle that special case with a trap to microcode", you have to build the whole pipeline to support the possibility that that pipeline stage could take that kind of exception.

Single-uop fxch requires support for swapping RAT entries (or some other mechanism) in the FP RAT (fRAT), but it's a separate block of hardware from the integer RAT (iRAT). Leaving out that complication in the iRAT seems reasonable even if you have it in the fRAT (pre-Haswell).

Issue/rename complexity is definitely an issue for power consumption, though. Note that Skylake widened a lot of the front-end (legacy decode and uop cache fetch), and retirement, but kept the 4-wide issue/rename limit. SKL also added replicated execution units on more port in the back-end, so issue bandwidth is a bottleneck even more of the time, especially in code with a mix of loads, stores, and ALU.

The RAT (or the integer register file, IDK) may even have limited read ports, since there seem to be some front-end bottlenecks in issuing/renaming many 3-input uops like add rax, [rcx+rdx]. I posted some microbenchmarks (this and the follow-up post) showing Skylake being faster than Haswell when reading lots of registers, e.g. with micro-fusion of indexed addressing modes. Or maybe the bottleneck there was really some other microarchitectural limit.


But how does 1-uop fxch work? IDK how it's done in Sandybridge / Ivybridge. In P6-family CPUs, an extra remapping table exists basically to support FXCH. That might only be needed because P6 uses a Retirement Register File with 1 entry per "logical" register, instead of a physical register file (PRF). As you say, you'd expect it to be simpler when even "cold" register values are just a pointer to a PRF entry. (Source: US patent 5,499,352: Floating point register alias table FXCH and retirement floating point register array (describes Intel's P6 uarch).

(Thanks Andy Glew (@krazyglew), I hadn't thought of looking up patents to find out about CPU internals.) It's pretty heavy going, but may provide some insight into the bookkeeping needed for speculative execution.

Interesting tidbit: the patent describes integer as well, and mentions that there are some "hidden" logical registers which are reserved for use by microcode. (Intel's 3-uop xchg almost certain uses one of these as a temporary.)


We might be able to get some insight from looking at what AMD does.

Interestingly, AMD has 2-uop xchg r,r in K10, Bulldozer-family, Bobcat/Jaguar, and Ryzen. (But Jaguar xchg r8,r8 is 3 uops. Maybe to support the xchg ah,al corner case without a special uop for swapping the low 16 of a single reg).

Presumably both uops read the old values of the input architectural registers before the first one updates the RAT. IDK exactly how this works, since they aren't necessarily issued/renamed in the same cycle (but they are at least contiguous in the uop flow, so at worst the 2nd uop is the first uop in the next cycle). I have no idea if Haswell's 2-uop fxch works similarly, or if they're doing something else.

Ryzen is a new architecture designed after mov-elimination was "invented", so presumably they take advantage of it wherever possible. (Bulldozer-family renames vector moves (but only for the low 128b lane of YMM vectors); Ryzen is the first AMD architecture to do it for GP regs too.) xchg r32,r32 and r64,r64 are zero-latency (renamed), but still 2 uops each. (r8 and r16 need an execution unit, because they merge with the old value instead of zero-extending or copying the entire reg, but are still only 2 uops).

Ryzen's fxch is 1 uop. AMD (like Intel) probably isn't spending a lot of transistors on making x87 fast (e.g. fmul is only 1 per clock and on the same port as fadd), so presumably they were able to do this without a lot of extra support. Their micro-coded x87 instructions (like fyl2x) are faster than on recent Intel CPUs, so maybe Intel cares even less (at least about the microcoded x87 instruction).

Maybe AMD could have made xchg r64,r64 a single uop too, more easily than Intel. Maybe even xchg r32,r32 could be single uop, since like Intel it needs to support mov r32,r32 zero-extension with no execution port, so maybe it could just set whatever "upper 32 zeroed" bit exists to support that. Ryzen doesn't eliminate movzx r32, r8 at rename, so presumably there's only an upper32-zero bit, not bits for other widths.


What Intel might be able to do cheaply if they wanted to:

It's possible that Intel could support 2-uop xchg r,r the way Ryzen does (zero latency for the r32,r32 and r64,r64 forms, or 1c for the r8,r8 and r16,r16 forms) without too much extra complexity in critical parts of the core, like the issue/rename and retirement stages that manage the Register Alias Table (RAT). But maybe not, if they can't have 2 uops read the "old" value of a register when the first uop writes it.

Stuff like xchg ah,al is definitely a extra complication, since Intel CPUs don't rename partial registers separately anymore, except AH/BH/CH/DH.


xchg latency in practice on current hardware

Your guess about how it might work internally is good. It almost certainly uses one of the internal temporary registers (accessible only to microcode). Your guess about how they can reorder is too limited, though. In fact, one direction has 2c latency and the other direction has ~1c latency.

00000000004000e0 <_start.loop>:
  4000e0:       48 87 d1                xchg   rcx,rdx   # slow version
  4000e3:       48 83 c1 01             add    rcx,0x1
  4000e7:       48 83 c1 01             add    rcx,0x1
  4000eb:       48 87 ca                xchg   rdx,rcx
  4000ee:       48 83 c2 01             add    rdx,0x1
  4000f2:       48 83 c2 01             add    rdx,0x1
  4000f6:       ff cd                   dec    ebp
  4000f8:       7f e6                   jg     4000e0 <_start.loop>

This loop runs in ~8.06 cycles per iteration on Skylake. Reversing the xchg operands makes it run in ~6.23c cycles per iteration (measured with perf stat on Linux). uops issued/executed counters are equal, so no elimination happened. It looks like the dst <- src direction is the slow one, since putting the add uops on that dependency chain makes things slower than when they're on the dst -> src dependency chain.

If you ever want to use xchg reg,reg on the critical path (code-size reasons?), do it with the dst -> src direction on the critical path, because that's only about 1c latency.


Other side-topics from comments and the question

Sandybridge-family decoders are different from Core2/Nehalem. They can produce up to 4 uops total, not 7, so the patterns are 1-1-1-1, 2-1-1, 3-1, or 4.

Also beware that if the last uop is one that can macro-fuse, they will hang onto it until the next decode cycle in case the first instruction in the next block is a jcc. (This is a win when code runs multiple times from the uop cache for each time it's decoded. And that's still usually 3 uops per clock decode throughput.)

Skylake has an extra "simple" decoder so it can do 1-1-1-1-1 up to 4-1 I guess, but > 4 uops for one instruction still requires the microcode ROM. Skylake beefed up the uop cache, too, and can often bottleneck on the 4 fused-domain uops per clock issue/rename throughput limit if the back-end (or branch misses) aren't a bottleneck first.

That seems kinda crazy, unless you're mostly limiting yourself to asm-level optimization in shorter loops inside your main loop. Any inner loops within the main loop will still run from the uop cache, and that should probably be where you're spending most of your time optimizing. Compilers usually do a good-enough job that it's not practical for a human to do much over a large scale. Try to write your C or C++ in such a way that the compiler can do a good job with it, of course, but looking for tiny peephole optimizations like this over 18kB of code seems like going down the rabbit hole.

Use perf counters like idq.dsb_uops vs. uops_issued.any to see how many of your total uops came from the uop cache (DSB = Decode Stream Buffer or something). Intel's optimization manual has some suggestions for other perf counters to look at for code that doesn't fit in the uop cache, such as DSB2MITE_SWITCHES.PENALTY_CYCLES. (MITE is the legacy-decode path). Search the pdf for DSB to find a few places it's mentioned.

Perf counters will help you find spots with potential problems, e.g. regions with higher than average uops_issued.stall_cycles could benefit from finding ways to expose more ILP if there are any, or from solving a front-end problem, or from reducing branch-mispredicts.


As discussed in comments, a single uop produces at most 1 register result

Terminology: the multiply result doesn't go into the ROB. It goes over the forwarding network to whatever other uops read it, and goes into the PRF.

The mul %rbx instruction decodes to 2 uops in the decoders. They don't even have to issue in the same cycle, let alone execute in the same cycle.

However, Agner Fog's instruction tables only list a single latency number. It turns out that 3 cycles is the latency from both inputs to RAX. The minimum latency for RDX is 4c, according to InstlatX64 testing on both Haswell and Skylake-X.

From this, I conclude that the 2nd uop is dependent on the first, and exists to write the high half of the result to an architectural register. The port1 uop produces a full 128b multiply result.

I don't know where the high-half result lives until the p6 uop reads it. Perhaps there's some sort of internal queue between the multiply execution unit and hardware connected to port 6. By scheduling the p6 uop with a dependency on the low-half result, that might arrange for the p6 uops from multiple in-flight mul instructions to run in the correct order. But then instead of actually using that dummy low-half input, the uop would take the high half result from the queue output in an execution unit that's connected to port 6 and return that as the result. (This is pure guess work, but I think it's plausible as one possible internal implementation. See comments for some earlier ideas).

Interestingly, according to Agner Fog's instruction tables, on Haswell the two uops for mul r64 go to ports 1 and 6. mul r32 is 3 uops, and runs on p1 + p0156. Agner doesn't say whether that's really 2p1 + p0156 or p1 + 2p0156 like he does for some other insns. (However, he says that mulx r32,r32,r32 runs on p1 + 2p056 (note that p056 doesn't include p1).)

Even more strangely, he says that Skylake runs mulx r64,r64,r64 on p1 p5 but mul r64 on p1 p6. If that's accurate and not a typo (which is a possibility), it pretty much rules out the possibility that the extra uop is an upper-half multiplier.

这篇关于为什么要XCHG reg,在现代Intel架构上注册3 micro-op指令?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-14 01:56