本文介绍了使用没有分组的分类键对数据帧中的值进行排序拆分应用组合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以...我有一个看起来像这样的数据框,但要大得多:

So... I have a Dataframe that looks like this, but much larger:

    DATE        ITEM    STORE   STOCK
0   2018-06-06     A    L001    4
1   2018-06-06     A    L002    0
2   2018-06-06     A    L003    4
3   2018-06-06     B    L001    1
4   2018-06-06     B    L002    2

您可以使用以下代码重现相同的 DataFrame:

You can reproduce the same DataFrame with the following code:

import pandas as pd
import numpy as np
import itertools as it

lojas = ['L001', 'L002', 'L003']
itens = list("ABC")
dr = pd.date_range(start='2018-06-06', end='2018-06-12')

df = pd.DataFrame(data=list(it.product(dr, itens, lojas)), columns=['DATE', 'ITEM', 'STORE'])
df['STOCK'] = np.random.randint(0,5, size=len(df.ITEM))

我想计算每对 ITEM-STORE 中的天数之间的 de STOCK 差异,并且使用 .diff() 函数可以很容易地迭代 groupby 对象中的组以获得如下结果:

I wanna calculate de STOCK difference between days in every pair ITEM-STORE and iterating over groups in a groupby object is easy using the function .diff() to get something like this:

    DATE       ITEM     STORE   STOCK   DELTA
0   2018-06-06    A     L001    4        NaN
9   2018-06-07    A     L001    0       -4.0
18  2018-06-08    A     L001    4        4.0
27  2018-06-09    A     L001    0       -4.0
36  2018-06-10    A     L001    3        3.0
45  2018-06-11    A     L001    2       -1.0
54  2018-06-12    A     L001    2        0.0

我已经通过以下代码设法做到了:

I´ve manage to do so by the following code:

gg = df.groupby([df.ITEM, df.STORE])
lg = []

for (name, group) in gg:
    aux = group.copy()
    aux.reset_index(drop=True, inplace=True)
    aux['DELTA'] = aux.STOCK.diff().fillna(value=0, inplace=Tr

    lg.append(aux)

df = pd.concat(lg) 

但是在大​​型 DataFrame 中,它变得不切实际.有没有更快更 Pythonic 的方式来完成这项任务?

But in a large DataFrame, it gets impracticable. Is there a faster more pythonic way to do this task?

推荐答案

我已经尝试改进您的 groupby 代码,所以这应该会快很多.

I've tried to improve your groupby code, so this should be a lot faster.

v = df.groupby(['ITEM', 'STORE'], sort=False).STOCK.diff()
df['DELTA'] = np.where(np.isnan(v), 0, v)

这里有一些提示/想法:

Some pointers/ideas here:

  1. 不要遍历组
  2. 如果系列属于同一个 DataFrame,则不要将系列作为石斑鱼传递.改为传递字符串标签.
  3. diff 可以矢量化
  4. 最后一行相当于一个fillna,但是fillnanp.where
  5. 指定 sort=False 将阻止输出按 grouper 键排序,从而进一步提高性能
  1. Don't iterate over groups
  2. Don't pass series as the groupers if the series belong to the same DataFrame. Pass string labels instead.
  3. diff can be vectorized
  4. The last line is tantamount to a fillna, but fillna is slower than np.where
  5. Specifying sort=False will prevent the output from being sorted by grouper keys, improving performance further

这也可以改写为

df['DELTA'] = df.groupby(['ITEM', 'STORE'], sort=False).STOCK.diff().fillna(0)

这篇关于使用没有分组的分类键对数据帧中的值进行排序拆分应用组合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-27 16:52