本文介绍了最近英特尔微架构中的简单解码器能否处理所有 1-µop 指令?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最近的 Intel CPU 的前端包含一个复杂的解码器和许多简单的解码器.复杂解码器可以处理解码为多个微操作的指令,而简单解码器仅支持解码为单个(融合域)微操作的指令.

The front end of recent Intel CPUs contains one complex decoder and a number of simple decoders. The complex decoder can handle instructions that decode to multiple µops, whereas the simple decoders support only instructions that decode to a single (fused-domain) µop.

所有 1-μop 指令都可以被简单解码器解码,还是有 1-μop 指令只能由复杂解码器处理?

Can all 1-µop instructions be decoded by the simple decoders, or are there 1-µop instructions that can only be handled by the complex decoder?

推荐答案

不行,有些指令只能解码1/clock

Andreas 的评论表明 xor eax,eax/setnle al 似乎有 1/clock 的解码瓶颈.我在 cdq 中发现了同样的事情:读取 EAX, 写入 EDX,也可以证明从 DSB(uop 缓存)运行得更快,并且不涉及部分寄存器或任何奇怪的东西,也不需要破坏指令.

No, there are some instructions that can only decode 1/clock

Andreas's comments indicate that xor eax,eax / setnle al seems to have a decode bottleneck of 1/clock. I found the same thing with cdq: Reads EAX, writes EDX, also demonstrably runs faster from the DSB (uop cache), and doesn't involve partial-registers or anything at all weird, and doesn't need a dep-breaking instruction.

更好的是,作为单字节指令,它可以仅用一小段指令就可以击败 DSB.(导致对某些 CPU 的测试产生误导性结果,例如在 Agner Fog 的表和 https://uops.info/,例如 SKX 显示为 1c 吞吐量.)https://www.uops.info/html-tp/SKX/CDQ-Measurements.htmlhttps://www.uops.info/html-tp/CFL/CDQ-Measurements.html 由于不同的测试方法而导致吞吐量不一致:只有 Coffee Lake 测试过的展开计数足够小(10) 为了不破坏 DSB,找到 0.6 的吞吐量.(当您考虑循环开销后,实际吞吐量为 0.5,完全由与 cqo 相同的后端端口压力来解释.IDK 为什么您会找到 0.6 而不是 0.55,而 p6 中只有一个额外的 uop循环.)

Even better, being a single-byte instruction it can defeat the DSB with only a short block of instructions. (Leading to misleading results from testing on some CPUs, e.g. in Agner Fog's tables and on https://uops.info/, e.g. SKX shown as 1c throughput.) https://www.uops.info/html-tp/SKX/CDQ-Measurements.html vs. https://www.uops.info/html-tp/CFL/CDQ-Measurements.html have inconsistent throughputs because of different testing methods: only the Coffee Lake test ever tested with a small enough unroll count (10) to not bust the DSB, finding a throughput of 0.6. (The actual throughput is 0.5 once you account for loop overhead, fully explained by back-end port pressure same as cqo. IDK why you'd find 0.6 instead of 0.55 with only one extra uop for p6 in the loop.)

(Zen 可以以 0.25c 的吞吐量运行此指令;没有奇怪的解码问题并且由每个整数 ALU 端口处理.)

(Zen can run this instructions with 0.25c throughput; no weird decode problems and handled by every integer-ALU port.)

times 10 cdq 在 dec/jnz 循环中可以从 uop 缓存中运行,并且在 Skylake (p06) 上以 0.5c 的吞吐量运行,加上循环开销也与 p6 竞争.

times 10 cdq in a dec/jnz loop can run from the uop cache, and runs at 0.5c throughput on Skylake (p06), plus loop overhead which also competes for p6.

times 20 cdq 对于一个 32 字节的机器代码块来说,超过 3 个 uop 缓存行,这意味着循环只能从传统解码运行(循环的顶部对齐).在 Skylake 上,每个 cdq 以 1 个周期运行.Perf 计数器确认 MITE 每个周期提供 1 uop,而不是 3 或 4 组,中间有空闲周期.

times 20 cdq is more than 3 uop cache lines for one 32-byte block of machine code, meaning the loop can only run from legacy decode (with the top of the loop aligned). On Skylake this runs at 1 cycle per cdq. Perf counters confirm MITE delivers 1 uop per cycle, rather than groups of 3 or 4 with idle cycles between.

default rel
%ifdef __YASM_VER__
    CPU Skylake AMD
%else
%use smartalign
alignmode p6, 64
%endif

global _start
_start:
    mov  ebp, 1000000000

align 64
.loop:
    ;times 10 cdq   ; 0.5c throughput
    ;times 20 cdq   ; 1c throughput, 1 MITE uop per cycle front-end

    ; times 10 cqo        ; 0.5c throughput 2-byte insn fits uop cache
    ; times 10 cdqe       ; 1c throughput data dependency
    ;times 10 cld         ; ~4c throughput, 3 uops

    dec ebp
    jnz .loop
.end:

    xor edi,edi
    mov eax,231   ; __NR_exit_group  from /usr/include/asm/unistd_64.h
    syscall       ; sys_exit_group(0)

在我的 Arch Linux 桌面上,我将它构建到一个静态可执行文件中以在 perf 下运行:

On my Arch Linux desktop, I built this into a static executable to run under perf:

  • i7-6700k,具有 epp=balance_performance(最大turbo"= 3.9GHz)
  • 微码修订版 0xd6(因此 LSD 被禁用,这并不重要:如果所有的 uop 都在 DSB uop 缓存 IIRC 中,循环只能从 LSD 循环缓冲区运行.)
     in a bash shell:
t=cdq-latency; nasm -f elf64 "$t".asm && ld -o "$t" "$t.o" && objdump -drwC -Mintel "$t" && taskset -c 3 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,frontend_retired.dsb_miss,idq.dsb_uops,idq.mite_uops,idq.mite_cycles,idq_uops_not_delivered.core,idq_uops_not_delivered.cycles_fe_was_ok,idq.all_mite_cycles_4_uops ./"$t"

拆解

0000000000401000 <_start>:
  401000:       bd 00 ca 9a 3b          mov    ebp,0x3b9aca00
  401005:       0f 1f 84 00 00 00 00 00         nop    DWORD PTR [rax+rax*1+0x0]
...
  40103d:       0f 1f 00                nop    DWORD PTR [rax]

0000000000401040 <_start.loop>:
  401040:       99                      cdq
  401041:       99                      cdq
  401042:       99                      cdq
  401043:       99                      cdq
...
  401052:       99                      cdq
  401053:       99                      cdq             # 20 total CDQ
  401054:       ff cd                   dec    ebp
  401056:       75 e8                   jne    401040 <_start.loop>

0000000000401058 <_start.end>:
  401058:       31 ff                   xor    edi,edi
  40105a:       b8 e7 00 00 00          mov    eax,0xe7
  40105f:       0f 05                   syscall

性能结果:

 Performance counter stats for './cdq-latency':

          5,205.44 msec task-clock                #    1.000 CPUs utilized
                 0      context-switches          #    0.000 K/sec
                 0      cpu-migrations            #    0.000 K/sec
                 1      page-faults               #    0.000 K/sec
    20,124,711,776      cycles                    #    3.866 GHz                      (49.88%)
    22,015,118,295      instructions              #    1.09  insn per cycle           (59.91%)
    21,004,212,389      uops_issued.any           # 4035.049 M/sec                    (59.97%)
     1,005,872,141      frontend_retired.dsb_miss #  193.235 M/sec                    (60.03%)
                 0      idq.dsb_uops              #    0.000 K/sec                    (60.08%)
    20,997,157,414      idq.mite_uops             # 4033.694 M/sec                    (60.12%)
    19,996,447,738      idq.mite_cycles           # 3841.451 M/sec                    (40.03%)
    59,048,559,790      idq_uops_not_delivered.core # 11343.621 M/sec                   (39.97%)
       112,956,733      idq_uops_not_delivered.cycles_fe_was_ok #   21.700 M/sec                    (39.92%)
           209,490      idq.all_mite_cycles_4_uops #    0.040 M/sec                    (39.88%)

       5.206491348 seconds time elapsed

所以循环开销 (dec/jnz) 基本上是免费的,在与最后一个 cdq 相同的循环中解码.计数并不准确,因为我在一次运行中使用了太多事件(启用了 HT),因此性能进行了软件多路复用.来自另一个计数器较少的运行:

So the loop overhead (dec/jnz) happened basically for free, decoding in the same cycle as the last cdq. Counts are not exact because I used too many events in one run (with HT enabled), so perf did software multiplexing. From another run with fewer counters:

# same source, only these HW counters enabled to avoid multiplexing
          5,161.14 msec task-clock                #    1.000 CPUs utilized

    20,107,065,550      cycles                    #    3.896 GHz
    20,000,134,955      idq.mite_cycles           # 3875.142 M/sec
    59,050,860,720      idq_uops_not_delivered.core # 11441.447 M/sec
        95,968,317      idq_uops_not_delivered.cycles_fe_was_ok #   18.594 M/sec

因此我们可以看到 MITE(旧版解码)基本上在每个周期都处于活动状态,并且前端基本上从不正常".(即永远不会在后端停滞).

So we can see that MITE (legacy decode) was active basically every cycle, and that the front-end was basically never "ok". (i.e. never stalled on the back-end).

只有 10 个 CDQ 指令,让 DSB 工作:

...
0000000000401040 <_start.loop>:
  401040:       99                      cdq
  401041:       99                      cdq
...
  401049:       99                      cdq        # 10 total CDQ insns
  40104a:       ff cd                   dec    ebp
  40104c:       75 f2                   jne    401040 <_start.loop>

 Performance counter stats for './cdq-latency' (4 runs):

          1,417.38 msec task-clock                #    1.000 CPUs utilized            ( +-  0.03% )
                 0      context-switches          #    0.000 K/sec
                 0      cpu-migrations            #    0.000 K/sec
                 1      page-faults               #    0.001 K/sec
     5,511,283,047      cycles                    #    3.888 GHz                      ( +-  0.03% )  (49.83%)
    11,997,247,694      instructions              #    2.18  insn per cycle           ( +-  0.00% )  (59.99%)
    10,999,182,841      uops_issued.any           # 7760.224 M/sec                    ( +-  0.00% )  (60.17%)
           197,753      frontend_retired.dsb_miss #    0.140 M/sec                    ( +- 13.62% )  (60.21%)
    10,988,958,908      idq.dsb_uops              # 7753.010 M/sec                    ( +-  0.03% )  (60.21%)
        10,234,859      idq.mite_uops             #    7.221 M/sec                    ( +- 27.43% )  (60.21%)
         8,114,909      idq.mite_cycles           #    5.725 M/sec                    ( +- 26.11% )  (39.83%)
        40,588,332      idq_uops_not_delivered.core #   28.636 M/sec                    ( +- 21.83% )  (39.79%)
     5,502,581,002      idq_uops_not_delivered.cycles_fe_was_ok # 3882.221 M/sec                    ( +-  0.01% )  (39.79%)
            56,223      idq.all_mite_cycles_4_uops #    0.040 M/sec                    ( +-  3.32% )  (39.79%)

          1.417599 +- 0.000489 seconds time elapsed  ( +-  0.03% )

idq_uops_not_delivered.cycles_fe_was_ok报道,基本上所有前端未使用的uop槽都是后端的故障(p0/p6上的端口压力),而不是前端.

As reported by idq_uops_not_delivered.cycles_fe_was_ok, basically all the unused front-end uop slots were the fault of the back-end (port pressure on p0 / p6), not the front-end.

这篇关于最近英特尔微架构中的简单解码器能否处理所有 1-µop 指令?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

05-18 13:45