本文介绍了相当于dplyr 1.0.0的python/pandas summary(across())的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在R中,当处理许多变量时,我发现以下内容非常有用:

In R, I find the following very useful when dealing with many variables:

library(dplyr)
dat <- group_by(mtcars, cyl)
summarize(dat, across(c('mpg','disp'), sum), across(c('drat','wt','qsec'), mean))
# A tibble: 3 x 5
    cyl  disp    hp  drat    wt
  <dbl> <dbl> <dbl> <dbl> <dbl>
1     4 1156.   909  4.07  2.29
2     6 1283.   856  3.59  3.12
3     8 4943.  2929  3.23  4.00

或更妙的是,使用伪正则表达式进行选择

Or even better, selecting with pseudo-regex

summarize(dat, across(ends_with('p'), sum), across(ends_with('t'), mean))

在熊猫中,等效项似乎是将变量一个接一个地传递到字典中,例如来自此要点:

In pandas, the equivalent seems to pass variables one-by-one into a dictionary, eg from this gist:

group_agg = df.groupby("group1").agg({
  "var1" : ["mean"],
  "var2" : ["sum"],
  "var3" : ["mean"]
  })

在熊猫或其他包装中是否有较简单的方法来执行此操作?

Is there a less verbose way to do this operation in pandas, or with some other package?

推荐答案

对于第一种情况, pandas concat 即可:

For the first scenario, pandas concat suffices :

dat = df.groupby("cyl")

pd.concat([dat[["mpg", "disp"]].sum(), dat[["drat", "wt", "qsec"]].mean()], axis=1)

对于正则表达式/字符串处理部分,冗长是不可避免的:

For the regex/string processing part, verbose is unavoidable :

cols_p = [col for col in df.columns if col.endswith("p")]
cols_t = [col for col in df.columns if col.endswith("t")]

pd.concat((dat[cols_p].sum(), dat[cols_t].mean()), axis=1)

但是,如果您可以编写一个可以封装 across 的函数,那就太酷了,特别是对于 regex 来说,这是一个很好的技巧.

It would be cool though, if you could write a function that could encapsulate the across, particularly for regex - that's a nice lovely trick.

注意:通过字典并不比您引用的第一个示例更长或更冗长.我建议通过 pandas concat 方法:

Note: passing a dictionary is not longer or more verbose than the first example you quoted. I would suggest that over the pandas concat method :

dat.agg({"mpg": "sum",
         "disp": "sum",
         "drat": "mean",
         "wt": "mean",
         "qsec": "mean"})

不会带走 cross ->的光芒.看起来很酷.

Doesn't take away the shine from across -> looks cool.

更新:对于正则表达式/字符串部分,请从 @Richiev 帖子中获取提示,其中的字典理解非常适合:

Update : For the regex/string part, taking a cue from @Richiev post, a dictionary comprehension fits in quite nicely here :

dat.agg({col :'mean'
         if col.endswith('t')
         else 'sum'
         for col in df.filter(regex=r".*(p|t)$").columns
         })

或者,您可以在不召唤 filter 的情况下做到这一点(必须再次使用该代码,并仔细研究Stack Overflow的想法以实现这一目标):

Alternatively, you could do it without summoning filter (had to play with the code again, and look through Stack Overflow ideas to pull this off) :

    dat.agg({col: "mean"
             if col.endswith("t") else "sum"
             for col in df
             if col.endswith(("t", "p"))})

来自此处的另一种想法:

   mapping = {"t": "mean", "p": "sum"}
   dat.agg({col: mapping.get(col[-1])
            for col in df
            if col.endswith(("t", "p"))})

使用Python中可用的工具,可能有更多的方法可以实现这一目标.

There are probably more ways to pull it off, using the available tools within Python.

这篇关于相当于dplyr 1.0.0的python/pandas summary(across())的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-20 10:25