本文介绍了在 32 位系统上使用 int64_t 而不是 int32_t 对性能有什么影响?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们的 C++ 库目前使用 time_t 来存储时间值.我开始在某些地方需要亚秒级精度,因此无论如何都需要更大的数据类型.此外,在某些地方解决 2038 年问题可能很有用.因此,我正在考虑完全切换到具有底层 int64_t 值的单个 Time 类,以替换所有位置的 time_t 值.

Our C++ library currently uses time_t for storing time values. I'm beginning to need sub-second precision in some places, so a larger data type will be necessary there anyway. Also, it might be useful to get around the Year-2038 problem in some places. So I'm thinking about completely switching to a single Time class with an underlying int64_t value, to replace the time_t value in all places.

现在我想知道在 32 位操作系统或 32 位 CPU 上运行此代码时这种更改对性能的影响.IIUC 编译器将生成代码以使用 32 位寄存器执行 64 位算术.但是如果这样太慢,我可能不得不使用更差异化的方式来处理时间值,这可能会使软件更难维护.

Now I'm wondering about the performance impact of such a change when running this code on a 32-bit operating system or 32-bit CPU. IIUC the compiler will generate code to perform 64-bit arithmetic using 32-bit registers. But if this is too slow, I might have to use a more differentiated way for dealing with time values, which might make the software more difficult to maintain.

我感兴趣的:

  • 哪些因素会影响这些操作的性能?可能是编译器和编译器版本;但是操作系统或 CPU 品牌/型号也会影响这一点吗?普通的 32 位系统会使用现代 CPU 的 64 位寄存器吗?
  • 在 32 位上模拟时哪些操作会特别慢?或者哪个几乎不会放缓?
  • 是否有在 32 位系统上使用 int64_t/uint64_t 的现有基准测试结果?
  • 有人对这种性能影响有自己的经验吗?

我对 Intel Core 2 系统上 Linux 2.6(RHEL5、RHEL6)上的 g++ 4.1 和 4.4 最感兴趣;但了解其他系统(如 Sparc Solaris + Solaris CC、Windows + MSVC)的情况也会很高兴.

I'm mostly interested in g++ 4.1 and 4.4 on Linux 2.6 (RHEL5, RHEL6) on Intel Core 2 systems; but it would also be nice to know about the situation for other systems (like Sparc Solaris + Solaris CC, Windows + MSVC).

推荐答案

主要是处理器架构(和模型 - 请阅读我在本节中提到处理器架构的模型).编译器可能有一些影响,但大多数编译器在这方面做得很好,所以处理器架构的影响会比编译器更大.

Mostly the processor architecture (and model - please read model where I mention processor architecture in this section). The compiler may have some influence, but most compilers do pretty well on this, so the processor architecture will have a bigger influence than the compiler.

操作系统不会有任何影响(除了如果你改变操作系统,你需要使用不同类型的编译器来改变编译器的功能"在某些情况下 - 但这可能影响很小).

The operating system will have no influence whatsoever (other than "if you change OS, you need to use a different type of compiler which changes what the compiler does" in some cases - but that's probably a small effect).

普通的 32 位系统会使用现代 CPU 的 64 位寄存器吗?

这是不可能的.如果系统处于 32 位模式,它将充当 32 位系统,额外的 32 位寄存器完全不可见,就像系统实际上是真正的 32 位系统"一样.

This is not possible. If the system is in 32-bit mode, it will act as a 32-bit system, the extra 32-bits of the registers is completely invisible, just as it would be if the system was actually a "true 32-bit system".

在 32 位上模拟时哪些操作会特别慢?或者哪个几乎不会放缓?

加法和减法更糟糕,因为它们必须按两个操作的顺序进行,而第二个操作需要第一个完成 - 如果编译器只是对独立数据生成两个加法运算,则情况并非如此.

Addition and subtraction, is worse as these have to be done in sequence of two operations, and the second operation requires the first to have completed - this is not the case if the compiler is just producing two add operations on independent data.

如果输入参数实际上是 64 位,乘法会变得更糟 - 例如,2^35 * 83 比 2^31 * 2^31 差.这是因为处理器可以很好地将 32 x 32 位乘法生成为 64 位结果 - 大约 5-10 个时钟周期.但是 64 x 64 位乘法需要相当多的额外代码,因此需要更长的时间.

Mulitplication will get a lot worse if the input parameters are actually 64-bits - so 2^35 * 83 is worse than 2^31 * 2^31, for example. This is due to the fact that the processor can produce a 32 x 32 bit multiply into a 64-bit result pretty well - some 5-10 clockcycles. But a 64 x 64 bit multiply requires a fair bit of extra code, so will take longer.

除法是一个与乘法类似的问题 - 但在这里可以在一侧取一个 64 位输入,将它除以一个 32 位值并得到一个 32 位值.由于很难预测何时会起作用,因此 64 位除法可能几乎总是很慢.

Division is a similar problem to multiplication - but here it's OK to take a 64-bit input on the one side, divide it by a 32-bit value and get a 32-bit value out. Since it's hard to predict when this will work, the 64-bit divide is probably nearly always slow.

数据也会占用两倍的缓存空间,这可能会影响结果.并且作为类似的结果,一般分配和传递数据所需的时间是最小值的两倍,因为要操作的数据量是原来的两倍.

The data will also take twice as much cache-space, which may impact the results. And as a similar consequence, general assignment and passing data around will take twice as long as a minimum, since there is twice as much data to operate on.

编译器还需要使用更多的寄存器.

The compiler will also need to use more registers.

是否有在 32 位系统上使用 int64_t/uint64_t 的现有基准测试结果?

可能,但我不知道.即使有,它也只会对您有点意义,因为操作的组合对操作的速度非常关键.

Probably, but I'm not aware of any. And even if there are, it would only be somewhat meaningful to you, since the mix of operations is HIGHLY critical to the speed of operations.

如果性能是您的应用程序的重要组成部分,则对您的代码(或其中的某些代表性部分)进行基准测试.如果 Benchmark X 给出的结果慢 5%、25% 或 103% 并不重要,如果您的代码在相同情况下慢或快的完全不同.

If performance is an important part of your application, then benchmark YOUR code (or some representative part of it). It doesn't really matter if Benchmark X gives 5%, 25% or 103% slower results, if your code is some completely different amount slower or faster under the same circumstances.

有人对这种性能影响有自己的经验吗?

我为 64 位架构重新编译了一些使用 64 位整数的代码,并发现性能有相当大的提升——在某些代码位上提高了 25%.

I've recompiled some code that uses 64-bit integers for 64-bit architecture, and found the performance improve by some substantial amount - as much as 25% on some bits of code.

将您的操作系统更改为同一操作系统的 64 位版本,也许会有帮助?

Changing your OS to a 64-bit version of the same OS, would help, perhaps?

因为我喜欢找出这些东西的不同之处,所以我编写了一些代码,并使用了一些原始模板(仍在学习这一点 - 模板并不是我最热门的话题,我必须说- 给我一点点和指针算术,我会(通常)做对......)

Because I like to find out what the difference is in these sort of things, I have written a bit of code, and with some primitive template (still learning that bit - templates isn't exactly my hottest topic, I must say - give me bitfiddling and pointer arithmetics, and I'll (usually) get it right... )

这是我写的代码,试图复制一些常见的函数:

Here's the code I wrote, trying to replicate a few common functons:

#include <iostream>
#include <cstdint>
#include <ctime>

using namespace std;

static __inline__ uint64_t rdtsc(void)
{
    unsigned hi, lo;
    __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
    return ( (uint64_t)lo)|( ((uint64_t)hi)<<32 );
}

template<typename T>
static T add_numbers(const T *v, const int size)
{
    T sum = 0;
    for(int i = 0; i < size; i++)
    sum += v[i];
    return sum;
}


template<typename T, const int size>
static T add_matrix(const T v[size][size])
{
    T sum[size] = {};
    for(int i = 0; i < size; i++)
    {
    for(int j = 0; j < size; j++)
        sum[i] += v[i][j];
    }
    T tsum=0;
    for(int i = 0; i < size; i++)
    tsum += sum[i];
    return tsum;
}



template<typename T>
static T add_mul_numbers(const T *v, const T mul, const int size)
{
    T sum = 0;
    for(int i = 0; i < size; i++)
    sum += v[i] * mul;
    return sum;
}

template<typename T>
static T add_div_numbers(const T *v, const T mul, const int size)
{
    T sum = 0;
    for(int i = 0; i < size; i++)
    sum += v[i] / mul;
    return sum;
}


template<typename T>
void fill_array(T *v, const int size)
{
    for(int i = 0; i < size; i++)
    v[i] = i;
}

template<typename T, const int size>
void fill_array(T v[size][size])
{
    for(int i = 0; i < size; i++)
    for(int j = 0; j < size; j++)
        v[i][j] = i + size * j;
}




uint32_t bench_add_numbers(const uint32_t v[], const int size)
{
    uint32_t res = add_numbers(v, size);
    return res;
}

uint64_t bench_add_numbers(const uint64_t v[], const int size)
{
    uint64_t res = add_numbers(v, size);
    return res;
}

uint32_t bench_add_mul_numbers(const uint32_t v[], const int size)
{
    const uint32_t c = 7;
    uint32_t res = add_mul_numbers(v, c, size);
    return res;
}

uint64_t bench_add_mul_numbers(const uint64_t v[], const int size)
{
    const uint64_t c = 7;
    uint64_t res = add_mul_numbers(v, c, size);
    return res;
}

uint32_t bench_add_div_numbers(const uint32_t v[], const int size)
{
    const uint32_t c = 7;
    uint32_t res = add_div_numbers(v, c, size);
    return res;
}

uint64_t bench_add_div_numbers(const uint64_t v[], const int size)
{
    const uint64_t c = 7;
    uint64_t res = add_div_numbers(v, c, size);
    return res;
}


template<const int size>
uint32_t bench_matrix(const uint32_t v[size][size])
{
    uint32_t res = add_matrix(v);
    return res;
}
template<const int size>
uint64_t bench_matrix(const uint64_t v[size][size])
{
    uint64_t res = add_matrix(v);
    return res;
}


template<typename T>
void runbench(T (*func)(const T *v, const int size), const char *name, T *v, const int size)
{
    fill_array(v, size);

    uint64_t long t = rdtsc();
    T res = func(v, size);
    t = rdtsc() - t;
    cout << "result = " << res << endl;
    cout << name << " time in clocks " << dec << t  << endl;
}

template<typename T, const int size>
void runbench2(T (*func)(const T v[size][size]), const char *name, T v[size][size])
{
    fill_array(v);

    uint64_t long t = rdtsc();
    T res = func(v);
    t = rdtsc() - t;
    cout << "result = " << res << endl;
    cout << name << " time in clocks " << dec << t  << endl;
}


int main()
{
    // spin up CPU to full speed...
    time_t t = time(NULL);
    while(t == time(NULL)) ;

    const int vsize=10000;

    uint32_t v32[vsize];
    uint64_t v64[vsize];

    uint32_t m32[100][100];
    uint64_t m64[100][100];


    runbench(bench_add_numbers, "Add 32", v32, vsize);
    runbench(bench_add_numbers, "Add 64", v64, vsize);

    runbench(bench_add_mul_numbers, "Add Mul 32", v32, vsize);
    runbench(bench_add_mul_numbers, "Add Mul 64", v64, vsize);

    runbench(bench_add_div_numbers, "Add Div 32", v32, vsize);
    runbench(bench_add_div_numbers, "Add Div 64", v64, vsize);

    runbench2(bench_matrix, "Matrix 32", m32);
    runbench2(bench_matrix, "Matrix 64", m64);
}

编译:

g++ -Wall -m32 -O3 -o 32vs64 32vs64.cpp -std=c++0x

结果是:注意:见下面的2016年结果 - 由于64位模式下SSE指令的使用不同,这些结果略有乐观,但32位模式下没有SSE使用模式.

And the results are: Note: See 2016 results below - these results are slightly optimistic due to the difference in usage of SSE instructions in 64-bit mode, but no SSE usage in 32-bit mode.

result = 49995000
Add 32 time in clocks 20784
result = 49995000
Add 64 time in clocks 30358
result = 349965000
Add Mul 32 time in clocks 30182
result = 349965000
Add Mul 64 time in clocks 79081
result = 7137858
Add Div 32 time in clocks 60167
result = 7137858
Add Div 64 time in clocks 457116
result = 49995000
Matrix 32 time in clocks 22831
result = 49995000
Matrix 64 time in clocks 23823

如您所见,加法和乘法并没有那么糟糕.分裂变得非常糟糕.有趣的是,矩阵加法根本没有太大区别.

As you can see, addition, and multiplication isn't that much worse. Division gets really bad. Interestingly, the matrix addition is not much difference at all.

在 64 位上是否更快?我听到有人问:使用相同的编译器选项,只是 -m64 而不是 -m32 - yupp,快得多:

And is it faster on 64-bit I hear some of you ask:Using the same compiler options, just -m64 instead of -m32 - yupp, a lot faster:

result = 49995000
Add 32 time in clocks 8366
result = 49995000
Add 64 time in clocks 16188
result = 349965000
Add Mul 32 time in clocks 15943
result = 349965000
Add Mul 64 time in clocks 35828
result = 7137858
Add Div 32 time in clocks 50176
result = 7137858
Add Div 64 time in clocks 50472
result = 49995000
Matrix 32 time in clocks 12294
result = 49995000
Matrix 64 time in clocks 14733

编辑,2016 年更新:在编译器的 32 位和 64 位模式下,有和没有 SSE 的四种变体.

Edit, update for 2016:four variants, with and without SSE, in 32- and 64-bit mode of the compiler.

这些天我通常使用 clang++ 作为我常用的编译器.我尝试用 g++ 编译(但它仍然是一个与上面不同的版本,因为我已经更新了我的机器 - 我也有一个不同的 CPU).由于 g++ 无法在 64 位中编译 no-sse 版本,因此我没有看到这一点.(无论如何,g++ 给出了类似的结果)

I'm typically using clang++ as my usual compiler these days. I tried compiling with g++ (but it would still be a different version than above, as I've updated my machine - and I have a different CPU too). Since g++ failed to compile the no-sse version in 64-bit, I didn't see the point in that. (g++ gives similar results anyway)

作为一个短表:

Test name      | no-sse 32 | no-sse 64 | sse 32 | sse 64 |
----------------------------------------------------------
Add uint32_t   |   20837   |   10221   |   3701 |   3017 |
----------------------------------------------------------
Add uint64_t   |   18633   |   11270   |   9328 |   9180 |
----------------------------------------------------------
Add Mul 32     |   26785   |   18342   |  11510 |  11562 |
----------------------------------------------------------
Add Mul 64     |   44701   |   17693   |  29213 |  16159 |
----------------------------------------------------------
Add Div 32     |   44570   |   47695   |  17713 |  17523 |
----------------------------------------------------------
Add Div 64     |  405258   |   52875   | 405150 |  47043 |
----------------------------------------------------------
Matrix 32      |   41470   |   15811   |  21542 |   8622 |
----------------------------------------------------------
Matrix 64      |   22184   |   15168   |  13757 |  12448 |

带有编译选项的完整结果.

Full results with compile options.

$ clang++ -m32 -mno-sse 32vs64.cpp --std=c++11 -O2
$ ./a.out
result = 49995000
Add 32 time in clocks 20837
result = 49995000
Add 64 time in clocks 18633
result = 349965000
Add Mul 32 time in clocks 26785
result = 349965000
Add Mul 64 time in clocks 44701
result = 7137858
Add Div 32 time in clocks 44570
result = 7137858
Add Div 64 time in clocks 405258
result = 49995000
Matrix 32 time in clocks 41470
result = 49995000
Matrix 64 time in clocks 22184

$ clang++ -m32 -msse 32vs64.cpp --std=c++11 -O2
$ ./a.out
result = 49995000
Add 32 time in clocks 3701
result = 49995000
Add 64 time in clocks 9328
result = 349965000
Add Mul 32 time in clocks 11510
result = 349965000
Add Mul 64 time in clocks 29213
result = 7137858
Add Div 32 time in clocks 17713
result = 7137858
Add Div 64 time in clocks 405150
result = 49995000
Matrix 32 time in clocks 21542
result = 49995000
Matrix 64 time in clocks 13757


$ clang++ -m64 -msse 32vs64.cpp --std=c++11 -O2
$ ./a.out
result = 49995000
Add 32 time in clocks 3017
result = 49995000
Add 64 time in clocks 9180
result = 349965000
Add Mul 32 time in clocks 11562
result = 349965000
Add Mul 64 time in clocks 16159
result = 7137858
Add Div 32 time in clocks 17523
result = 7137858
Add Div 64 time in clocks 47043
result = 49995000
Matrix 32 time in clocks 8622
result = 49995000
Matrix 64 time in clocks 12448


$ clang++ -m64 -mno-sse 32vs64.cpp --std=c++11 -O2
$ ./a.out
result = 49995000
Add 32 time in clocks 10221
result = 49995000
Add 64 time in clocks 11270
result = 349965000
Add Mul 32 time in clocks 18342
result = 349965000
Add Mul 64 time in clocks 17693
result = 7137858
Add Div 32 time in clocks 47695
result = 7137858
Add Div 64 time in clocks 52875
result = 49995000
Matrix 32 time in clocks 15811
result = 49995000
Matrix 64 time in clocks 15168

这篇关于在 32 位系统上使用 int64_t 而不是 int32_t 对性能有什么影响?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-05 07:59