用户空间中的memcpy性能不佳，导致Linux中的mmap物理内存不足

本文介绍了用户空间中的memcpy性能不佳，导致Linux中的mmap物理内存不足的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在我的计算机上安装的192GB RAM中，我有4GB以上的188GB RAM(在硬件地址0x100000000)，由Linux内核在启动时保留(mem = 4G memmap = 188G $ 4G).数据采集内核模块将数据累积到使用DMA的较大区域中，用作环形缓冲区.用户空间应用程序将这个环形缓冲区映射到用户空间，然后从环形缓冲区的当前位置复制块，以便在块准备好后进行处理.

Of 192GB RAM installed on my computer, I have 188GB RAM above 4GB (at hardware address 0x100000000) reserved by the Linux kernel at boot time (mem=4G memmap=188G$4G). A data acquisition kernel modules accumulates data into this large area used as a ring buffer using DMA. A user space application mmap's this ring buffer into user space, then copies blocks from the ring buffer at the current location for processing once they are ready.

使用memcpy从mmap'ed区域复制这些16MB的块无法达到我的预期.看来性能取决于启动时保留的内存大小(以及以后映射到用户空间中). http://www.wurmsdobler.org/files/resmem.zip 包含内核模块的源代码，该内核模块确实实现了mmap文件操作:

Copying these 16MB blocks from the mmap'ed area using memcpy does not perform as I expected. It appears that the performance depends on the size of the memory reserved at boot time (and later mmap'ed into user space). http://www.wurmsdobler.org/files/resmem.zip contains the source code for a kernel module which does implements the mmap file operation:

module_param(resmem_hwaddr, ulong, S_IRUSR);
module_param(resmem_length, ulong, S_IRUSR);
//...
static int resmem_mmap(struct file *filp, struct vm_area_struct *vma) {
remap_pfn_range(vma, vma->vm_start,
    resmem_hwaddr >> PAGE_SHIFT,
    resmem_length, vma->vm_page_prot);
return 0;
}

和一个测试应用程序，该应用程序实质上(删除了检查):

and a test application, which does in essence (with the checks removed):

#define BLOCKSIZE ((size_t)16*1024*1024)
int resMemFd = ::open(RESMEM_DEV, O_RDWR | O_SYNC);
unsigned long resMemLength = 0;
::ioctl(resMemFd, RESMEM_IOC_LENGTH, &resMemLength);
void* resMemBase = ::mmap(0, resMemLength, PROT_READ | PROT_WRITE, MAP_SHARED, resMemFd, 4096);
char* source = ((char*)resMemBase) + RESMEM_HEADER_SIZE;    
char* destination = new char[BLOCKSIZE];
struct timeval start, end;
gettimeofday(&start, NULL);
memcpy(destination, source, BLOCKSIZE);
gettimeofday(&end, NULL);
float time = (end.tv_sec - start.tv_sec)*1000.0f + (end.tv_usec - start.tv_usec)/1000.0f;
    std::cout << "memcpy from mmap'ed to malloc'ed: " << time << "ms (" << BLOCKSIZE/1000.0f/time << "MB/s)" << std::endl;

我已经在SuperMicro 1026GT-TF-FM109的Ubuntu 10.04.4，Linux 2.6.32上针对不同大小的保留RAM(resmem_length)对16MB数据块进行了memcpy测试:

I have carried out memcpy tests of a 16MB data block for the different sizes of reserved RAM (resmem_length) on Ubuntu 10.04.4, Linux 2.6.32, on a SuperMicro 1026GT-TF-FM109:

|      |         1GB           |          4GB           |         16GB           |        64GB            |        128GB            |         188GB
|run 1 | 9.274ms (1809.06MB/s) | 11.503ms (1458.51MB/s) | 11.333ms (1480.39MB/s) |  9.326ms (1798.97MB/s) | 213.892ms (  78.43MB/s) | 206.476ms (  81.25MB/s)
|run 2 | 4.255ms (3942.94MB/s) |  4.249ms (3948.51MB/s) |  4.257ms (3941.09MB/s) |  4.298ms (3903.49MB/s) | 208.269ms (  80.55MB/s) | 200.627ms (  83.62MB/s)

我的观察是:

从第一次运行到第二次运行，从mmap'ed到malloc'ed的memcpy似乎受益于内容可能已经缓存在某个地方.

From the first to the second run, memcpy from mmap'ed to malloc'ed seems to benefit that the contents might already be cached somewhere.

> 64GB会导致性能显着下降，这在使用memcpy时都可以注意到.

There is a significant performance degradation from >64GB, which can be noticed both when using a memcpy.

我想理解为什么会这样.也许Linux内核开发人员小组中的某人认为:64GB对任何人都足够(这算了吗?)

I would like to understand why that so is. Perhaps somebody in the Linux kernel developers group thought: 64GB should be enough for anybody (does this ring a bell?)

亲切的问候，彼得

推荐答案

基于SuperMicro的反馈，性能下降是由于NUMA(非均匀内存访问)引起的. SuperMicro 1026GT-TF-FM109使用X8DTG-DF主板，其核心是一个Intel 5520 Tylersburg芯片组，并连接到两个Intel Xeon E5620 CPU，每个CPU都装有96GB的RAM.

Based on feedback from SuperMicro, the performance degradation is due to NUMA, non-uniform memory access. The SuperMicro 1026GT-TF-FM109 uses the X8DTG-DF motherboard with one Intel 5520 Tylersburg chipset at its heart, connected to two Intel Xeon E5620 CPUs, each of which has 96GB RAM attached.

如果我将应用程序锁定到CPU0，则可以观察到不同的memcpy速度，具体取决于保留的内存区域以及因此而发生的mmap'.如果保留的内存区域位于CPU以外，则mmap会花费一些时间来完成其工作，并且随后往返于远程"区域的所有memcpy都会消耗更多时间(数据块大小= 16MB):

If I lock my application to CPU0, I can observe different memcpy speeds depending on what memory area was reserved and consequently mmap'ed. If the reserved memory area is off-CPU, then mmap struggles for some time to do its work, and any subsequent memcpy to and from the "remote" area consumes more time (data block size = 16MB):

resmem=64G$4G   (inside CPU0 realm):   3949MB/s  
resmem=64G$96G  (outside CPU0 realm):    82MB/s  
resmem=64G$128G (outside CPU0 realm):  3948MB/s
resmem=92G$4G   (inside CPU0 realm):   3966MB/s            
resmem=92G$100G (outside CPU0 realm):    57MB/s

这几乎是有道理的.只有第三种情况是64G $ 128，这意味着最高的64GB也会产生良好的结果.这与理论有些矛盾.

It nearly makes sense. Only the third case, 64G$128, which means the uppermost 64GB also yield good results. This contradicts somehow the theory.

关于，彼得

这篇关于用户空间中的memcpy性能不佳，导致Linux中的mmap物理内存不足的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！