本文介绍了在传递给apply()的自定义函数中访问先前计算的结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Python处理Pandas,在将自定义函数应用于序列时,我想访问以前的计算结果.

I'm working with Pandas in Python and I would like to access the result of the previous calculation when applying a custom function to a series.

大致像这样:

import pandas

# How can I obtain previous_result?
def foo(value, previous_result = None):

    # On the first iteration there is no previous result
    if previous_result is None:
        previous_result = value

    return value + previous_result

series = pandas.Series([1,2,3])
print(series.apply(foo))

这也可以概括为如何将n以前的结果传递给函数?".我了解series.rolling(),但是即使滚动,我也无法获得先前的结果,只能获得输入序列的先前值.

This can also be generalized to "How to pass the n previous results to the function?". I know about series.rolling() but even with rolling I wasn't able to obtain the previous results, only the previous values of the input series.

推荐答案

您描述的最特殊的操作类型为cummaxcummincumprodcumsum(f(x) = x + f(x-1)).

The most special type of the operations you describe are available as cummax, cummin, cumprod and cumsum (f(x) = x + f(x-1)).

更多功能可以在 expanding中找到对象:均值,标准差,方差峰度,偏度,相关性等.

More functionality can be found in expanding objects: mean, standard deviation, variance kurtosis, skewness, correlation, etc.

在大多数情况下,可以将expanding().apply()与自定义功能一起使用.例如,

And for the most general case, you can use expanding().apply() with a custom function. For example,

from functools import reduce  # For Python 3.x
ser.expanding().apply(lambda r: reduce(lambda prev, value: prev + 2*value, r))

等同于f(x) = 2x + f(x-1)

我列出的方法已经过优化,并且运行速度非常快,但是当您使用自定义函数时,性能会变差.对于指数平滑,对于长度为1000的Series,熊猫开始表现出比循环更好的效果,但是expanding().apply()与reduce的性能相当差:

The methods I listed are optimized and run quite fast but when you use a custom function the performance gets worse. For exponential smoothing, pandas starts to outperform loops for Series of length 1000 but expanding().apply()'s performance with reduce is quite bad:

np.random.seed(0)    
ser = pd.Series(70 + 5*np.random.randn(10**4))    
ser.tail()
Out: 
9995    60.953592
9996    70.211794
9997    72.584361
9998    69.835397
9999    76.490557
dtype: float64


ser.ewm(alpha=0.1, adjust=False).mean().tail()
Out: 
9995    69.871614
9996    69.905632
9997    70.173505
9998    70.139694
9999    70.774781
dtype: float64

%timeit ser.ewm(alpha=0.1, adjust=False).mean()
1000 loops, best of 3: 779 µs per loop

带有循环:

def exp_smoothing(ser, alpha=0.1):
    prev = ser[0]
    res = [prev]
    for cur in ser[1:]:
        prev = alpha*cur + (1-alpha)*prev
        res.append(prev)
    return pd.Series(res, index=ser.index)

exp_smoothing(ser).tail()
Out: 
9995    69.871614
9996    69.905632
9997    70.173505
9998    70.139694
9999    70.774781
dtype: float64

%timeit exp_smoothing(ser)
100 loops, best of 3: 3.54 ms per loop

总时间仍以毫秒为单位,但使用expanding().apply():

Total time is still in milliseconds but with expanding().apply():

ser.expanding().apply(lambda r: reduce(lambda p, v: 0.9*p+0.1*v, r)).tail()
Out: 
9995    69.871614
9996    69.905632
9997    70.173505
9998    70.139694
9999    70.774781
dtype: float64

%timeit ser.expanding().apply(lambda r: reduce(lambda p, v: 0.9*p+0.1*v, r))
1 loop, best of 3: 13 s per loop

cummincumsum之类的方法经过优化,仅需要x的当前值和函数的先前值.但是,使用自定义功能时,复杂度为O(n**2).这主要是因为在某些情况下,函数的先前值和x的当前值不足以计算函数的当前值.对于cumsum,您可以使用以前的cumsum并添加当前值以得出结果.您不能这样做,例如说几何均值.这就是expanding对于中等大小的Series都将无法使用的原因.

Methods like cummin, cumsum are optimized and only require x's current value and function's previous value. However with a custom function the complexity is O(n**2). This is mainly because there will be cases that function's previous value and x's current value won't be enough to calculate function's current value. For cumsum, you can use previous cumsum and add the current value to reach a result. You cannot do that for, say, geometric mean. That's why expanding will become unusable for even moderately sized Series.

通常,对Series进行迭代并不是很昂贵的操作.对于DataFrames,它需要返回每一行的副本,因此效率非常低,但是Series并非如此.当然,应该在可用的情况下使用向量化方法,但是如果不是这种情况,则对递归计算之类的任务使用for循环就可以了.

In general, iterating over a Series is not a very expensive operation. For DataFrames, it needs to return a copy of each row so it is very inefficient but this is not the case for Series. Of course you should use vectorized methods when available but if that's not the case, using a for loop for a task like recursive calculation is OK.

这篇关于在传递给apply()的自定义函数中访问先前计算的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-30 05:46