是否使用写合并缓冲区对Intel上的WB内存区域进行常规写入？

本文介绍了是否使用写合并缓冲区对Intel上的WB内存区域进行常规写入？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

写组合缓冲区一直是Intel CPU的功能，至少可以追溯到Pentium 4甚至更早。基本思想是这些高速缓存行大小的缓冲区将写操作收集到同一高速缓存行中，因此可以将它们作为一个单元进行处理。作为其对软件性能的影响的一个示例，如果您不编写完整的缓存行，则可能会遇到

我们可以观察到以下情况：

很显然，有10个LFB。

如果可以进行写合并或合并，则对于任何数量的存储， L1D_PEND_MISS.FB_FULL 为零。

跨度为64字节时， L1D_PEND_MISS.FB_FULL 大于零当商店的数量大于10时。

两个WC和UC被归类为不可缓存。因此，您可以将这两个语句放在一起得出WC对于写入WC内存特别重要。

另请参见：。

Write-combining buffers have been a feature of Intel CPUs going back to at least the Pentium 4 and probably before. The basic idea is that these cache-line sized buffers collect writes to the same cache line so they can be handled as a unit. As an example of their implications for software performance, if you don't write the full cache line, you may experience reduced performance.

For example, in Intel 64 and IA-32 Architectures Optimization Reference Manual section "3.6.10 Write Combining" starts with the following description (emphasis added):

My question is whether write combining applies to WB memory regions (that's the "normal" memory you are using 99.99% of the time in user programs), when using normal stores (that's anything other than non-temporal stores, i.e., the stores you are using 99.99% of the time).

The text above is hard to interpret exactly, and since not to have been updated since the Core Duo era. You have the part that says write combing "applies to WC memory but not UC", but of course that leaves out all the other types, like WB. Later you have that "[WC is] particularly important for writes to uncached memory", seemly contradicting the "doesn't apply to UC part".

So are write combining buffers used on modern Intel chips for normal stores to WB memory?

解决方案

Yes, the write combining and coalescing properties of the LFBs support all memory types except the UC type. You can observe their impact experimentally using the following program. It takes two parameters as input:

STORE_COUNT: the number of 8-byte stores to perform sequentially.
INCREMENT: the stride between consecutive stores.

There are 4 different values of INCREMENT that are particularly interesting:

64: All stores are performed on unique cache lines. Write combining and coalescing will not take an effect.
0: All stores are to the same cache line and the same location within that line. Write coalescing takes effect in this case.
8: Every 8 consecutive stores are to the same cache line, but different locations within that line. Write combining takes effect in this case.
4: The target locations of consecutive stores overlap within the same cache line. Some stores might cross two cache lines (depending on STORE_COUNT). Both write combining and coalescing will take an effect.

There is another parameter, ITERATIONS, which is used to repeat the same experiment many times to make reliable measurements. You can keep it at 1000.

%define ITERATIONS 1000

BITS 64
DEFAULT REL

section .bss
align 64
bufsrc:     resb STORE_COUNT*64

section .text
global _start
_start:  
    mov ecx, ITERATIONS

.loop:
; Flush all the cache lines to make sure that it takes a substantial amount of time to fetch them.
    lea rsi, [bufsrc]
    mov edx, STORE_COUNT
.flush:
    clflush [rsi]
    sfence
    lfence
    add rsi, 64
    sub edx, 1
    jnz .flush

; This is the main loop where the stores are issued sequentially.
    lea rsi, [bufsrc]
    mov edx, STORE_COUNT
.inner:
    mov [rsi], rdx
    sfence ; Prevents potential combining in the store buffer.
    add rsi, INCREMENT
    sub edx, 1
    jnz .inner

; Spend sometime doing nothing so that all the LFBs become free for the next iteration.
    mov edx, 100000
.wait:
    lfence
    sub edx, 1
    jnz .wait

    sub ecx, 1
    jnz .loop

; Exit.    
    xor edi,edi
    mov eax,231
    syscall

I recommend the following setup:

Disable all hardware prefetchers using sudo wrmsr -a 0x1A4 0xf. This ensures that they will not interfere (or have minimal interference) with the experiments.
Set the CPU frequency to the maximum. This increases the probability that the main loop will be fully executed before the first cache line reaches the L1 and causes an LFB to be freed.
Disable hyperthreading because the LFBs are shared (at least since Sandy Bridge, but not on all microarchitectures).

The L1D_PEND_MISS.FB_FULL performance counter enables us to capture the effect of write combining regarding how it impacts the availability of LFBs. It is supported on Intel Core and later. It is described as follows:

First run the code without the inner loop and make sure that L1D_PEND_MISS.FB_FULL is zero, which means the the flush loop has no impact on the event count.

The following figure plots STORE_COUNT against total L1D_PEND_MISS.FB_FULL divided by ITERATIONS.

We can observe the following:

It's clear that there are exactly 10 LFBs.
When write combining or coalescing is possible, L1D_PEND_MISS.FB_FULL is zero for any number of stores.
When the stride is 64 bytes, L1D_PEND_MISS.FB_FULL is larger than zero when the number of stores is larger than 10.

Both WC and UC are classified as uncachable. So you can put the two statements together to deduce that WC is particularly important for writes to WC memory.

这篇关于是否使用写合并缓冲区对Intel上的WB内存区域进行常规写入？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！