IDK是否值得教编译器寻求这种优化围绕整个函数,而不是跨越函数内部的一个调用,这可能是个好主意.正如我所说,它基于悲观假设,即foo无论如何都将保存/恢复RBX. (或者优化吞吐量,如果您知道从x到返回值的延迟并不重要.但是编译器却不知道,通常针对延迟进行优化.)如果您开始在大量代码中做出悲观假设(例如围绕函数内部的单个函数调用),则将开始遇到更多不保存/还原RBX且可以利用的情况.您也不希望在循环中进行额外的保存/恢复推送/弹出操作,只需在循环外保存/恢复RBX并在进行函数调用的循环中使用保留调用的寄存器.即使没有循环,一般情况下大多数函数也会进行多个函数调用.如果您真的在任何调用之间都不使用x,恰好在第一次调用和最后一次调用之后,,那么您可以使用这种优化思路,否则您将难以为每个call保持16字节堆栈对齐如果您在通话后进行一次流行,则在另一通话之前进行.通常,编译器并不擅长于小功能.但这对CPU也不是很好. 非内联函数调用在最佳情况下会对优化产生影响,除非编译器可以看到被调用方的内部结构并做出比平常更多的假设.非内联函数调用是隐式的内存屏障:调用者必须假定函数可以读取或写入任何全局可访问的数据,因此所有此类var必须与C抽象机同步. (转义分析允许将局部变量保留在调用中的寄存器中,前提是它们的地址没有转义该函数.)此外,编译器还必须假定所有调用寄存器都被破坏了.这在x86-64系统V中很烂,因为它没有调用保留的XMM寄存器. 像bar()这样的细小函数最好内联到其调用程序中.使用-flto进行编译,因此在大多数情况下,即使在文件边界上也可能发生这种情况. (函数指针和共享库边界可以克服这一点.)我认为编译器无需费心尝试进行这些优化的一个原因是,它需要在编译器内部使用一堆不同的代码,这与普通堆栈和寄存器不同-分配代码,该代码知道如何保存调用保留的寄存器并使用它们. 即实施起来会很繁琐,需要维护大量代码,如果这样做过分热情,它可能会使代码变得更糟糕.而且(希望)不重要;如果重要,则应将bar内联到其调用方中,或将foo内联到bar中.除非有很多不同的类似bar的函数并且foo很大,和由于某种原因它们无法内联到调用者中,否则这很好.Consider this C code:void foo(void);long bar(long x) { foo(); return x;}When I compile it on GCC 9.3 with either -O3 or -Os, I get this:bar: push r12 mov r12, rdi call foo mov rax, r12 pop r12 retThe output from clang is identical except for choosing rbx instead of r12 as the callee-saved register.However, I want/expect to see assembly that looks more like this:bar: push rdi call foo pop rax retIn English, here's what I see happening:Push the old value of a callee-saved register to the stackMove x into that callee-saved registerCall fooMove x from the callee-saved register into the return-value registerPop the stack to restore the old value of the callee-saved registerWhy bother to mess with a callee-saved register at all? Why not do this instead? It seems shorter, simpler, and probably faster:Push x to the stackCall fooPop x from the stack into the return-value registerIs my assembly wrong? Is it somehow less efficient than messing with an extra register? If the answer to both of those are "no", then why don't either GCC or clang do it this way?Godbolt link.Edit: Here's a less trivial example, to show it happens even if the variable is meaningfully used:long foo(long);long bar(long x) { return foo(x * x) - x;}I get this:bar: push rbx mov rbx, rdi imul rdi, rdi call foo sub rax, rbx pop rbx retI'd rather have this:bar: push rdi imul rdi, rdi call foo pop rdi sub rax, rdi retThis time, it's only one instruction off vs. two, but the core concept is the same.Godbolt link. 解决方案 TL:DR:Compiler internals are probably not set up to look for this optimization easily, and it's probably only useful around small functions, not inside large functions between calls.Inlining to create large functions is a better solution most of the timeThere can be a latency vs. throughput tradeoff if foo happens not to save/restore RBX.Compilers are complex pieces of machinery. They're not "smart" like a human, and expensive algorithms to find every possible optimization are often not worth the cost in extra compile time.I reported this as GCC bug 69986 - smaller code possible with -Os by using push/pop to spill/reload back in 2016; there's been no activity or replies from GCC devs. :/Slightly related: GCC bug 70408 - reusing the same call-preserved register would give smaller code in some cases - compiler devs told me it would take a huge amount of work for GCC to be able to do that optimization because it requires picking order of evaluation of two foo(int) calls based on what would make the target asm simpler.If foo doesn't save/restore rbx itself, there's a tradeoff between throughput (instruction count) vs. an extra store/reload latency on the x -> retval dependency chain.Compilers usually favour latency over throughput, e.g. using 2x LEA instead of imul reg, reg, 10 (3-cycle latency, 1/clock throughput), because most code averages significantly less than 4 uops / clock on typical 4-wide pipelines like Skylake. (More instructions/uops do take more space in the ROB, reducing how far ahead the same out-of-order window can see, though, and execution is actually bursty with stalls probably accounting for some of the less-than-4 uops/clock average.)If foo does push/pop RBX, then there's not much to gain for latency. Having the restore happen just before the ret instead of just after is probably not relevant, unless there a ret mispredict or I-cache miss that delays fetching code at the return address.Most non-trivial functions will save/restore RBX, so it's often not a good assumption that leaving a variable in RBX will actually mean it truly stayed in a register across the call. (Although randomizing which call-preserved registers functions choose might be a good idea to mitigate this sometimes.)So yes push rdi / pop rax would be more efficient in this case, and this is probably a missed optimization for tiny non-leaf functions, depending on what foo does and the balance between extra store/reload latency for x vs. more instructions to save/restore the caller's rbx.It is possible for stack-unwind metadata to represent the changes to RSP here, just like if it had used sub rsp, 8 to spill/reload x into a stack slot. (But compilers don't know this optimization either, of using push to reserve space and initialize a variable. What C/C++ compiler can use push pop instructions for creating local variables, instead of just increasing esp once?. And doing that for more than one local var would lead to larger .eh_frame stack unwind metadata because you're moving the stack pointer separately with each push. That doesn't stop compilers from using push/pop to save/restore call-preserved regs, though.)IDK if it would be worth teaching compilers to look for this optimizationIt's maybe a good idea around a whole function, not across one call inside a function. And as I said, it's based on the pessimistic assumption that foo will save/restore RBX anyway. (Or optimizing for throughput if you know that latency from x to return value isn't important. But compilers don't know that and usually optimize for latency).If you start making that pessimistic assumption in lots of code (like around single function calls inside functions), you'll start getting more cases where RBX isn't saved/restored and you could have taken advantage.You also don't want this extra save/restore push/pop in a loop, just save/restore RBX outside the loop and use call-preserved registers in loops that make function calls. Even without loops, in the general case most functions make multiple function calls. This optimization idea could apply if you really don't use x between any of the calls, just before the first and after the last, otherwise you have a problem of maintaining 16-byte stack alignment for each call if you're doing one pop after a call, before another call.Compilers are not great at tiny functions in general. But it's not great for CPUs either. Non-inline function calls have an impact on optimization at the best of times, unless compilers can see the internals of the callee and make more assumptions than usual. A non-inline function call is an implicit memory barrier: a caller has to assume that a function might read or write any globally-accessible data, so all such vars have to be in sync with the C abstract machine. (Escape analysis allows keeping locals in registers across calls if their address hasn't escaped the function.) Also, the compiler has to assume that the call-clobbered registers are all clobbered. This sucks for floating point in x86-64 System V, which has no call-preserved XMM registers.Tiny functions like bar() are better off inlining into their callers. Compile with -flto so this can happen even across file boundaries in most cases. (Function pointers and shared-library boundaries can defeat this.)I think one reason compilers haven't bothered to try to do these optimizations is that it would require a whole bunch of different code in the compiler internals, different from the normal stack vs. register-allocation code that knows how to save call-preserved registers and use them.i.e. it would be a lot of work to implement, and a lot of code to maintain, and if it gets over-enthusiastic about doing this it could make worse code.And also that it's (hopefully) not significant; if it matters, you should be inlining bar into its caller, or inlining foo into bar. This is fine unless there are a lot of different bar-like functions and foo is large, and for some reason they can't inline into their callers. 这篇关于为什么编译器坚持在此处使用被调用者保存的寄存器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持! 上岸,阿里云!
09-05 09:42