问题描述
所以...我有一个看起来像这样的数据框,但要大得多:
So... I have a Dataframe that looks like this, but much larger:
DATE ITEM STORE STOCK
0 2018-06-06 A L001 4
1 2018-06-06 A L002 0
2 2018-06-06 A L003 4
3 2018-06-06 B L001 1
4 2018-06-06 B L002 2
您可以使用以下代码重现相同的 DataFrame
:
You can reproduce the same DataFrame
with the following code:
import pandas as pd
import numpy as np
import itertools as it
lojas = ['L001', 'L002', 'L003']
itens = list("ABC")
dr = pd.date_range(start='2018-06-06', end='2018-06-12')
df = pd.DataFrame(data=list(it.product(dr, itens, lojas)), columns=['DATE', 'ITEM', 'STORE'])
df['STOCK'] = np.random.randint(0,5, size=len(df.ITEM))
我想计算每对 ITEM-STORE 中的天数之间的 de STOCK 差异,并且使用 .diff()
函数可以很容易地迭代 groupby 对象中的组以获得如下结果:
I wanna calculate de STOCK difference between days in every pair ITEM-STORE and iterating over groups in a groupby object is easy using the function .diff()
to get something like this:
DATE ITEM STORE STOCK DELTA
0 2018-06-06 A L001 4 NaN
9 2018-06-07 A L001 0 -4.0
18 2018-06-08 A L001 4 4.0
27 2018-06-09 A L001 0 -4.0
36 2018-06-10 A L001 3 3.0
45 2018-06-11 A L001 2 -1.0
54 2018-06-12 A L001 2 0.0
我已经通过以下代码设法做到了:
I´ve manage to do so by the following code:
gg = df.groupby([df.ITEM, df.STORE])
lg = []
for (name, group) in gg:
aux = group.copy()
aux.reset_index(drop=True, inplace=True)
aux['DELTA'] = aux.STOCK.diff().fillna(value=0, inplace=Tr
lg.append(aux)
df = pd.concat(lg)
但是在大型 DataFrame 中,它变得不切实际.有没有更快更 Pythonic 的方式来完成这项任务?
But in a large DataFrame, it gets impracticable. Is there a faster more pythonic way to do this task?
推荐答案
我已经尝试改进您的 groupby 代码,所以这应该会快很多.
I've tried to improve your groupby code, so this should be a lot faster.
v = df.groupby(['ITEM', 'STORE'], sort=False).STOCK.diff()
df['DELTA'] = np.where(np.isnan(v), 0, v)
这里有一些提示/想法:
Some pointers/ideas here:
- 不要遍历组
- 如果系列属于同一个 DataFrame,则不要将系列作为石斑鱼传递.改为传递字符串标签.
diff
可以矢量化- 最后一行相当于一个
fillna
,但是fillna
比np.where
慢 - 指定
sort=False
将阻止输出按 grouper 键排序,从而进一步提高性能
- Don't iterate over groups
- Don't pass series as the groupers if the series belong to the same DataFrame. Pass string labels instead.
diff
can be vectorized- The last line is tantamount to a
fillna
, butfillna
is slower thannp.where
- Specifying
sort=False
will prevent the output from being sorted by grouper keys, improving performance further
这也可以改写为
df['DELTA'] = df.groupby(['ITEM', 'STORE'], sort=False).STOCK.diff().fillna(0)
这篇关于使用没有分组的分类键对数据帧中的值进行排序拆分应用组合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!