问题描述
在R中,当处理许多变量时,我发现以下内容非常有用:
In R, I find the following very useful when dealing with many variables:
library(dplyr)
dat <- group_by(mtcars, cyl)
summarize(dat, across(c('mpg','disp'), sum), across(c('drat','wt','qsec'), mean))
# A tibble: 3 x 5
cyl disp hp drat wt
<dbl> <dbl> <dbl> <dbl> <dbl>
1 4 1156. 909 4.07 2.29
2 6 1283. 856 3.59 3.12
3 8 4943. 2929 3.23 4.00
或更妙的是,使用伪正则表达式进行选择
Or even better, selecting with pseudo-regex
summarize(dat, across(ends_with('p'), sum), across(ends_with('t'), mean))
在熊猫中,等效项似乎是将变量一个接一个地传递到字典中,例如来自此要点:
In pandas, the equivalent seems to pass variables one-by-one into a dictionary, eg from this gist:
group_agg = df.groupby("group1").agg({
"var1" : ["mean"],
"var2" : ["sum"],
"var3" : ["mean"]
})
在熊猫或其他包装中是否有较简单的方法来执行此操作?
Is there a less verbose way to do this operation in pandas, or with some other package?
推荐答案
对于第一种情况, pandas concat
即可:
For the first scenario, pandas concat
suffices :
dat = df.groupby("cyl")
pd.concat([dat[["mpg", "disp"]].sum(), dat[["drat", "wt", "qsec"]].mean()], axis=1)
对于正则表达式/字符串处理部分,冗长是不可避免的:
For the regex/string processing part, verbose is unavoidable :
cols_p = [col for col in df.columns if col.endswith("p")]
cols_t = [col for col in df.columns if col.endswith("t")]
pd.concat((dat[cols_p].sum(), dat[cols_t].mean()), axis=1)
但是,如果您可以编写一个可以封装 across
的函数,那就太酷了,特别是对于 regex
来说,这是一个很好的技巧.
It would be cool though, if you could write a function that could encapsulate the across
, particularly for regex
- that's a nice lovely trick.
注意:通过字典并不比您引用的第一个示例更长或更冗长.我建议通过 pandas concat
方法:
Note: passing a dictionary is not longer or more verbose than the first example you quoted. I would suggest that over the pandas concat
method :
dat.agg({"mpg": "sum",
"disp": "sum",
"drat": "mean",
"wt": "mean",
"qsec": "mean"})
不会带走 cross
->的光芒.看起来很酷.
Doesn't take away the shine from across
-> looks cool.
更新:对于正则表达式/字符串部分,请从 @Richiev 帖子中获取提示,其中的字典理解非常适合:
Update : For the regex/string part, taking a cue from @Richiev post, a dictionary comprehension fits in quite nicely here :
dat.agg({col :'mean'
if col.endswith('t')
else 'sum'
for col in df.filter(regex=r".*(p|t)$").columns
})
或者,您可以在不召唤 filter
的情况下做到这一点(必须再次使用该代码,并仔细研究Stack Overflow的想法以实现这一目标):
Alternatively, you could do it without summoning filter
(had to play with the code again, and look through Stack Overflow ideas to pull this off) :
dat.agg({col: "mean"
if col.endswith("t") else "sum"
for col in df
if col.endswith(("t", "p"))})
来自此处的另一种想法:
mapping = {"t": "mean", "p": "sum"}
dat.agg({col: mapping.get(col[-1])
for col in df
if col.endswith(("t", "p"))})
使用Python中可用的工具,可能有更多的方法可以实现这一目标.
There are probably more ways to pull it off, using the available tools within Python.
这篇关于相当于dplyr 1.0.0的python/pandas summary(across())的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!