为什么Python在numpy数组切片上循环比完全矢量化操作快

本文介绍了为什么Python在numpy数组切片上循环比完全矢量化操作快的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要通过对3D数据数组设置阈值来创建布尔掩码:在数据小于可接受下限或数据大于可接受上限的位置处的掩码必须设置为True(否则为False).简洁地:

I need to create a boolean mask by thresholding a 3D data array: mask at locations where data are smaller than lower acceptable limit or data are larger than upper acceptable limit must be set to True (otherwise False). Succinctly:

mask = (data < low) or (data > high)

我有两种版本的代码可以执行此操作:一种直接在numpy中的整个3D数组上工作，而另一种方法在数组的切片上循环.与我的预期相反，第二种方法似乎比第一种更快.为什么?

I have two versions of the code for performing this operation: one works directly with entire 3D arrays in numpy while the other method loops over slices of the array. Contrary to my expectations, the second method seems to be faster than the first one. Why???

In [1]: import numpy as np

In [2]: import sys

In [3]: print(sys.version)
3.6.2 |Continuum Analytics, Inc.| (default, Jul 20 2017, 13:14:59) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]

In [4]: print(np.__version__)
1.14.0

In [5]: arr = np.random.random((10, 1000, 1000))

In [6]: def method1(arr, low, high):
   ...:     """ Fully vectorized computations """
   ...:     out = np.empty(arr.shape, dtype=np.bool)
   ...:     np.greater_equal(arr, high, out)
   ...:     np.logical_or(out, arr < low, out)
   ...:     return out
   ...: 

In [7]: def method2(arr, low, high):
   ...:     """ Partially vectorized computations """
   ...:     out = np.empty(arr.shape, dtype=np.bool)
   ...:     for k in range(arr.shape[0]):
   ...:         a = arr[k]
   ...:         o = out[k]
   ...:         np.greater_equal(a, high, o)
   ...:         np.logical_or(o, a < low, o)
   ...:     return out
   ...:

首先，让我们确保两种方法产生相同的结果:

First of all, let's make sure that both methods produce identical results:

In [8]: np.all(method1(arr, 0.2, 0.8) == method2(arr, 0.2, 0.8))
Out[8]: True

现在进行一些计时测试:

And now some timing tests:

In [9]: %timeit method1(arr, 0.2, 0.8)
14.4 ms ± 111 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [10]: %timeit method2(arr, 0.2, 0.8)
11.5 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

这是怎么回事?

在较旧的环境中会观察到类似的行为:

EDIT 1: A similar behavior is observed in an older environment:

In [3]: print(sys.version)
2.7.13 |Continuum Analytics, Inc.| (default, Dec 20 2016, 23:05:08) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]

In [4]: print(np.__version__)
1.11.3

In [9]: %timeit method1(arr, 0.2, 0.8)
100 loops, best of 3: 14.3 ms per loop

In [10]: %timeit method2(arr, 0.2, 0.8)
100 loops, best of 3: 13 ms per loop

推荐答案

两种方法均优于

在方法一中，您两次访问数组.如果它不适合缓存，将从RAM中读取数据两次，这会降低性能.此外，有可能如注释中所述创建临时数组.

In method one you are accessing the array twice. If it doesn't fit in the cache, the data will be read two times from RAM which lowers the performance. Additionally it is possible that temporary arrays are created as mentioned in the comments.

方法二更易于缓存，因为您两次访问了阵列的一小部分，因此很可能适合缓存.不利之处是循环慢，调用函数也多，而且速度也很慢.

Method two is more cache friendly, since you are accessing a smaller part of the array twice, which is likely to fit in cache. The downsides are slow looping and more function calls, which are also quite slow.

要在这里获得良好的性能，建议编译代码，可以使用cython或numba进行编译.由于cython版本还有更多工作(注释，需要单独的编译器)，因此我将展示如何使用Numba来实现.

To get a good performance here it is recommended to compile the code, which can be done using cython or numba. Since the cython version is some more work (annotating, need of a seperate compiler), I will show how to do this using Numba.

import numba as nb
@nb.njit(fastmath=True, cache=True)
def method3(arr, low, high):
  out = np.empty(arr.shape, dtype=nb.boolean)
  for i in range(arr.shape[0]):
    for j in range(arr.shape[1]):
      for k in range(arr.shape[2]):
        out[i,j,k]=arr[i,j,k] < low or arr[i,j,k] > high
  return out

在PC上(Core i7-4771，python 3.5，Windows)，使用arr = np.random.random((10, 1000, 1000))的性能要比您的method_1高出两倍，而method_2则高出50％

Using arr = np.random.random((10, 1000, 1000)) this outperforms your method_1 by a factor of two and your method_2 by 50 percent on my PC (Core i7-4771, python 3.5, windows)

这只是一个简单的示例，在更复杂的代码上，您可以使用SIMD，并且非常容易使用的并行处理可以大大提高性能.在非编译代码中，矢量化通常是可行的，但并非总是如此(如图所示)，这是您可以做的最好的事情，但是它总是会导致不良的高速缓存行为，如果您要访问的数据块至少不适合，则会导致性能欠佳.在三级缓存中.在其他一些问题上，如果数据不能容纳在较小的L1或L2缓存中，也会对性能造成影响.另一个优点是在调用此函数的njited函数中自动内嵌小的njited函数.

This is only a simple example, on more complex code, where you can make use of SIMD, and parallel processing which is also very easy to use the performance gain can be a lot bigger. On non compiled code vectorization is usualy but not always (as shown) the best you can do, but it will always lead to a bad cache behaiviour which can lead to suboptimal performance if the chunks of data you are acessing don't fit at least in L3 cache. On some other problems there will be also a performance hit if the data can't fit in the much smaller L1 or L2 cache. Another advantage will be automatic inlining of small njited functions in a njited function which calls this functions.

这篇关于为什么Python在numpy数组切片上循环比完全矢量化操作快的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！