本文介绍了2阵列/映像相乘时的多线程性能-Intel IPP的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Intel IPP进行2个图像(阵列)的乘法.
我使用的是Intel Composer 2015 Update 6随附的Intel IPP 8.2.

I'm using Intel IPP for multiplication of 2 Images (Arrays).
I'm using Intel IPP 8.2 which comes with Intel Composer 2015 Update 6.

我创建了一个简单的函数来放大太大的图像(整个项目已附加,请参见下文).
我想看看使用Intel IPP多线程库的好处.

I created a simple function to multiply too large images (The whole project is attached, see below).
I wanted to see the gains using Intel IPP Multi Threaded Library.

这是简单的项目(我还附加了完整的项目表Visual Studio):

Here is the simple project (I also attached the complete project form Visual Studio):

#include "ippi.h"
#include "ippcore.h"
#include "ipps.h"
#include "ippcv.h"
#include "ippcc.h"
#include "ippvm.h"

#include <ctime>
#include <iostream>

using namespace std;

const int height = 6000;
const int width  = 6000;
Ipp32f mInput_image [1 * width * height];
Ipp32f mOutput_image[1 * width * height] = {0};

int main()
{
    IppiSize size = {width, height};

    double start = clock();

    for (int i = 0; i < 200; i++)
        ippiMul_32f_C1R(mInput_image, 6000 * 4, mInput_image, 6000 * 4, mOutput_image, 6000 * 4, size); 

    double end = clock();
    double douration = (end - start) / static_cast<double>(CLOCKS_PER_SEC);

    cout << douration << endl;
    cin.get();

    return 0;
}

我曾经使用Intel IPP单线程和一次使用Intel IPP多线程来编译该项目.

I compiled this project once using Intel IPP Single Threaded and once using Intel IPP Multi Threaded.

我尝试了不同大小的数组,而在所有这些中,多线程版本都没有收益(有时甚至更慢).

I tried different sizes of arrays and in all of them the Multi Threaded version yields no gains (Sometimes it is even slower).

我想知道,多线程处理此任务为什么没有收益?
我知道英特尔IPP使用了AVX,我想也许该任务就变成了内存受限"?

I wonder, how come there is no gain in this task with multi threading?
I know Intel IPP uses the AVX and I thought maybe the task becomes Memory Bounded?

我尝试了另一种方法,即通过手动使用OpenMP使之具有使用Intel IPP单线程实现的多线程方法.这是代码:

I tried another approach by using OpenMP manually to have Multi Threaded approach using Intel IPP Single Thread implementation.
This is the code:

#include "ippi.h"
#include "ippcore.h"
#include "ipps.h"
#include "ippcv.h"
#include "ippcc.h"
#include "ippvm.h"

#include <ctime>
#include <iostream>

using namespace std;

#include <omp.h>

const int height = 5000;
const int width  = 5000;
Ipp32f mInput_image [1 * width * height];
Ipp32f mOutput_image[1 * width * height] = {0};

int main()
{
    IppiSize size = {width, height};

    double start = clock();

    IppiSize blockSize = {width, height / 4};

    const int NUM_BLOCK = 4;
    omp_set_num_threads(NUM_BLOCK);

    Ipp32f*  in;
    Ipp32f*  out;

    //  ippiMul_32f_C1R(mInput_image, width * 4, mInput_image, width * 4, mOutput_image, width * 4, size);

    #pragma omp parallel            \
    shared(mInput_image, mOutput_image, blockSize) \
    private(in, out)
    {
        int id   = omp_get_thread_num();
        int step = blockSize.width * blockSize.height * id;
        in       = mInput_image  + step;
        out      = mOutput_image + step;
        ippiMul_32f_C1R(in, width * 4, in, width * 4, out, width * 4, blockSize);
    }

    double end = clock();
    double douration = (end - start) / static_cast<double>(CLOCKS_PER_SEC);

    cout << douration << endl;
    cin.get();

    return 0;
}

结果还是一样,没有表现.

The results were the same, again, no gain of performance.

在这种任务中是否有办法从多线程中受益?
如何验证任务是否受内存限制,因此并行执行没有好处?将CPU上的2个阵列与AVX相乘的并行化任务有好处吗?

Is there a way to benefit from Multi Threading in this kind of task?
How can I validate whether a task becomes memory bounded and hence no benefit in parallelize it?Are there benefit to parallelize task of multiplying 2 arrays on CPU with AVX?

我尝试过的计算机基于Core i7 4770k(Haswell).

The Computers I tried it on is based on Core i7 4770k (Haswell).

这里是 Visual Studio 2013中的项目.

谢谢.

推荐答案

您的图像总共占用200 MB(2 x 5000 x 5000 x 4字节).因此,每个块包含50 MB的数据.这是CPU的L3缓存大小的6倍以上(请参阅此处).每个AVX向量乘法都对256位数据进行操作,这是高速缓存行的一半,即每个向量指令消耗一条高速缓存行(每个自变量的高速缓存行减半).在Haswell上进行矢量化乘法的延迟时间为5个周期,FPU可以每个周期撤销两个这样的指令(请参见这里). i7-4770K的内存总线的额定速度为25.6 GB/s(最大理论!)或每秒不超过4.3亿个缓存行. CPU的标称速度为3.5 GHz. AVX部件的时钟频率较低,例如3.1 GHz.以这种速度,每秒需要多一个数量级的缓存行才能完全为AVX引擎提供数据.

Your images occupy 200 MB in total (2 x 5000 x 5000 x 4 bytes). Each block therefore consists of 50 MB of data. This is more than 6 times than the size of your CPU's L3 cache (see here). Each AVX vector multiplication operates on 256 bits of data, which is half a cache line, i.e. it consumes one cache line per vector instruction (half a cache line for each argument). A vectorised multiplication on Haswell has a latency of 5 cycles and the FPU can retire two such instructions per cycle (see here). The memory bus of i7-4770K is rated at 25.6 GB/s (theoretical maximum!) or no more than 430 million cache lines per second . The nominal speed of the CPU is 3.5 GHz. The AVX part is clocked a bit lower, let's say at 3.1 GHz. At that speed, it takes an order of magnitude more cache lines per second to fully feed the AVX engine.

在这种情况下,矢量化代码的单个线程几乎使CPU的内存总线饱和.添加第二个线程可能会导致非常轻微的改进.添加更多线程只会导致争用并增加开销.加快计算速度的唯一方法是增加内存带宽:

In those conditions, a single thread of vectorised code saturates almost fully the memory bus of your CPU. Adding a second thread might result in a very slight improvement. Adding further threads only results in contentions and added overhead. The only way to speed up such a calculation is to increase the memory bandwidth:

  • 在具有更多内存控制器的NUMA系统上运行,因此具有更高的聚合内存带宽,例如多路服务器主板;
  • 切换到具有更高内存带宽的其他架构,例如英特尔至强融核或GPGPU.
  • run on a NUMA system with more memory controllers and therefore higher aggregate memory bandwidth, e.g. a multisocket server board;
  • switch to a different architecture with much higher memory bandwidth, e.g. Intel Xeon Phi or a GPGPU.

这篇关于2阵列/映像相乘时的多线程性能-Intel IPP的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-24 11:39