本文介绍了计算按特定列分组的项目之间的平均时间差的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下数据框:

  userid |时间
1 22.01.2001 13:00
1 22.01.2001 13:05
1 22.01.2001 13:07
2 22.01.2001 14:00
2 22.01.2001 14:04
2 22.01.2001 13:05
2 22.01.2001 13:06
3 22.01.2001 13:20
3 22.01.2001 13:22
4 22.01.2001 13:37

我想获得的是每列用户存储连续活动之间的平均时差:

  userid | avg_time_diff 
1 3.5#(5 + 2)/ 2
2 2#(4 + 1 + 1)/ 3
3 2
4 0

为了达到这个目的,是否需要循环遍历每个用户并逐一计算平均时间差?或者,有没有更快的方法来达到同样的结果?

解决方案

考虑以下方法:

  In [84]:df.sort_values('time')。groupby('userid')['time'] \ 
.apply(lambda x:x.diff()。dt.seconds .mean()/ 60)
Out [84]:
userid
1 3.500000
2 19.666667
3 2.000000
4 NaN
名称:time,dtype:float64

一些解释:

首先我们按照 time 列对DF进行排序,否则我们可能会产生负面的差异。

然后我们分组由 userid ,并为每个组计算所有连续行的时间差(排序) - 这将产生一系列 timedelta64 [ns] code> dtype,它有一个 .dt.seconds 访问器。

使用 .dt.seconds.mean()
我们可以计算每个组的平均值

更新:



  In [ 122]:threshold = 60 
...:
...:(df.sort_values('time')。groupby('userid')['time']
... :.apply(lambda x:(x.diff()。dt.seconds / 60)
...:.to_frame('diff')
...:.query(diff< @threshold)['diff']。mean()))
...:
Out [122]:
userid
1 3.500000
2 19.666667
3 2.000000
4 NaN
名称:time,dtype:float64


I have the following dataframe:

userid | time     
1        22.01.2001 13:00
1        22.01.2001 13:05   
1        22.01.2001 13:07  
2        22.01.2001 14:00
2        22.01.2001 14:04   
2        22.01.2001 13:05  
2        22.01.2001 13:06  
3        22.01.2001 13:20  
3        22.01.2001 13:22  
4        22.01.2001 13:37  

What I want to obtain is a new column per user that stores the average time difference among the consecutive activities:

userid | avg_time_diff
1        3.5    #(5 + 2) / 2
2        2      #(4 + 1 + 1) / 3
3        2
4        0

To achieve this, do I need to loop trough each user and calculate the average time difference one by one? Or, is there a quicker way to achieve the same result?

解决方案

Consider the following approach:

In [84]: df.sort_values('time').groupby('userid')['time'] \
           .apply(lambda x: x.diff().dt.seconds.mean()/60)
Out[84]:
userid
1     3.500000
2    19.666667
3     2.000000
4          NaN
Name: time, dtype: float64

Some explanations:

First we sort the DF by time column, otherwise we might have negative difference.

Then we group by userid and for each group we calculate a time difference for all consecutive rows (sorted) - this will produce a Series of timedelta64[ns] dtype, which has an .dt.seconds accessor.

Using .dt.seconds.mean() we can calculate the average for each group

UPDATE:

In [122]: threshold = 60
     ...:
     ...: (df.sort_values('time').groupby('userid')['time']
     ...:    .apply(lambda x: (x.diff().dt.seconds/60)
     ...:                     .to_frame('diff')
     ...:                     .query("diff < @threshold")['diff'].mean()))
     ...:
Out[122]:
userid
1     3.500000
2    19.666667
3     2.000000
4          NaN
Name: time, dtype: float64

这篇关于计算按特定列分组的项目之间的平均时间差的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

11-01 08:42