为什么要引入无用MOV指令加快x86_64的装配紧密循环？

本文介绍了为什么要引入无用MOV指令加快x86_64的装配紧密循环？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

背景：

在优化一些 code。与嵌入式汇编语言，我注意到了一个不必要的 MOV 指令，并删除它。

While optimizing some Pascal code with embedded assembly language, I noticed an unnecessary MOV instruction, and removed it.

要我惊讶的是，剔除非必要的指令导致我的程序的放缓的

To my surprise, removing the un-necessary instruction caused my program to slow down.

我发现的添加任意的，无用的 MOV 指令提高性能更进一步。

I found that adding arbitrary, useless MOV instructions increased performance even further.

效果是不稳定的，并根据执行顺序的变化：同垃圾指令换位向上或向下一行生产放缓

The effect is erratic, and changes based on execution order: the same junk instructions transposed up or down by a single line produce a slowdown.

据我所知，CPU完成所有种类的优化和简化，但是，这似乎更像是黑魔法。

I understand that the CPU does all kinds of optimizations and streamlining, but, this seems more like black magic.

数据：

我的code的一个版本有条件编译三个垃圾操作在运行一个循环的中间 2 ** 20 == 1048576 倍。（周围的程序才算）。

A version of my code conditionally compiles three junk operations in the middle of a loop that runs 2**20==1048576 times. (The surrounding program just calculates SHA-256 hashes).

我的，而旧机（英特尔（R）酷睿（TM）2 CPU 6400 @ 2.13＆NBSP;千兆赫）的结果：

The results on my rather old machine (Intel(R) Core(TM)2 CPU 6400 @ 2.13 GHz):

avg time (ms) with -dJUNKOPS: 1822.84 ms
avg time (ms) without:        1836.44 ms

中的程序是在一个循环中运行25次，运行顺序随机变化，每次

The programs were run 25 times in a loop, with the run order changing randomly each time.

摘录：

{$asmmode intel}
procedure example_junkop_in_sha256;
  var s1, t2 : uint32;
  begin
    // Here are parts of the SHA-256 algorithm, in Pascal:
    // s0 {r10d} := ror(a, 2) xor ror(a, 13) xor ror(a, 22)
    // s1 {r11d} := ror(e, 6) xor ror(e, 11) xor ror(e, 25)
    // Here is how I translated them (side by side to show symmetry):
  asm
    MOV r8d, a                 ; MOV r9d, e
    ROR r8d, 2                 ; ROR r9d, 6
    MOV r10d, r8d              ; MOV r11d, r9d
    ROR r8d, 11    {13 total}  ; ROR r9d, 5     {11 total}
    XOR r10d, r8d              ; XOR r11d, r9d
    ROR r8d, 9     {22 total}  ; ROR r9d, 14    {25 total}
    XOR r10d, r8d              ; XOR r11d, r9d

    // Here is the extraneous operation that I removed, causing a speedup
    // s1 is the uint32 variable declared at the start of the Pascal code.
    //
    // I had cleaned up the code, so I no longer needed this variable, and
    // could just leave the value sitting in the r11d register until I needed
    // it again later.
    //
    // Since copying to RAM seemed like a waste, I removed the instruction,
    // only to discover that the code ran slower without it.
    {$IFDEF JUNKOPS}
    MOV s1,  r11d
    {$ENDIF}

    // The next part of the code just moves on to another part of SHA-256,
    // maj { r12d } := (a and b) xor (a and c) xor (b and c)
    mov r8d,  a
    mov r9d,  b
    mov r13d, r9d // Set aside a copy of b
    and r9d,  r8d

    mov r12d, c
    and r8d, r12d  { a and c }
    xor r9d, r8d

    and r12d, r13d { c and b }
    xor r12d, r9d

    // Copying the calculated value to the same s1 variable is another speedup.
    // As far as I can tell, it doesn't actually matter what register is copied,
    // but moving this line up or down makes a huge difference.
    {$IFDEF JUNKOPS}
    MOV s1,  r9d // after mov r12d, c
    {$ENDIF}

    // And here is where the two calculated values above are actually used:
    // T2 {r12d} := S0 {r10d} + Maj {r12d};
    ADD r12d, r10d
    MOV T2, r12d

  end
end;

亲自试一试：

在code是在GitHub上在线的不断提高性能？

为什么同一个无用的指令提供一些线的提速，以及对他人的放缓？

这是行为的东西，可能由编译器？

pdictably被利用$ P $

Why would uselessly copying a register's contents to RAM ever increase performance?
Why would the same useless instruction provide a speedup on some lines, and a slowdown on others?
Is this behavior something that could be exploited predictably by a compiler?

same

为什么要引入无用MOV指令加快x86_64的装配紧密循环？

问题描述

推荐答案