问题描述
假设我们要 group_by()
和 summerise
包含许多列的大型data.frame,但这有一些大的连续列将具有相同的汇总
条件(例如 max
,平均值
等)
Suppose we want to group_by()
and summarise
a massive data.frame with very many columns, but that there are some large groups of consecutive columns that will have the same summarise
condition (e.g. max
, mean
etc)
有没有一种方法可以避免必须指定摘要
条件,并为列范围设置条件?
Is there a way to avoid having to specify the summarise
condition for each and every column, and instead do it for ranges of columns?
假设我们要这样做:
iris %>%
group_by(Species) %>%
summarise(max(Sepal.Length), mean(Sepal.Width), mean(Petal.Length), mean(Petal.Width))
但请注意,连续3列具有相同的总结
条件,平均值(Sepal.Width),平均值(Petal.Length),平均值(Petal.Width)
but note that 3 consecutive columns have the same summarise
condition, mean(Sepal.Width), mean(Petal.Length), mean(Petal.Width)
是否可以使用 mean(Sepal.Width:Petal.Width)
这样的方法来指定列范围的条件,因此避免了必须为之间的所有列多次键入汇总条件)
Is there a way to use some method like mean(Sepal.Width:Petal.Width)
to specify the condition for the range of columns, and hence a avoiding having to type out the summarise condition multiple times for all the columns in between)
上面的虹膜示例是一个小型且易于管理的示例,具有3个连续列的范围,但实际用例有数百个。
The iris example above is a small and manageable example that has a range of 3 consecutive columns, but actual use case has ~hundreds.
推荐答案
即将发布的版本的 dplyr
将具有函数可以实现您想要的
The upcoming version 1.0.0 of dplyr
will have across()
function that does what you wish for
across()
有两个主要参数:
-
第一个参数
.cols
,选择要操作的列。
它使用整洁的选择(例如select()
),因此您可以按
的位置,名称和类型来选择变量。
The first argument,
.cols
, selects the columns you want to operate on. It uses tidy selection (likeselect()
) so you can pick variables by position, name, and type.
第二个参数 .fns
是一个函数或函数列表,适用于每列
。这也可以是Purrr样式的公式(或公式列表)
,例如〜.x / 2
。 (此参数是可选的,如果只希望
来获取基础数据,则可以忽略它;您会看到
vignette( rowwise)
。)
The second argument, .fns
, is a function or list of functions to apply to each column. This can also be a purrr style formula (or list of formulas) like ~ .x / 2
. (This argument is optional, and you can omit it if you just want to get the underlying data; you'll see that technique used in vignette("rowwise")
.)
### Install development version on GitHub first
# install.packages("devtools")
# devtools::install_github("tidyverse/dplyr")
library(dplyr, warn.conflicts = FALSE)
控制方式名称是使用 .names
参数创建的,该参数采用规范:
Control how the names are created with the .names
argument which takes a glue spec:
iris %>%
group_by(Species) %>%
summarise(
across(c(Sepal.Width:Petal.Width), ~ mean(.x, na.rm = TRUE), .names = "mean_{col}"),
across(c(Sepal.Length), ~ max(.x, na.rm = TRUE), .names = "max_{col}")
)
#> # A tibble: 3 x 5
#> Species mean_Sepal.Width mean_Petal.Leng~ mean_Petal.Width max_Sepal.Length
#> * <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 3.43 1.46 0.246 5.8
#> 2 versicolor 2.77 4.26 1.33 7
#> 3 virginica 2.97 5.55 2.03 7.9
使用多种功能
my_func <- list(
mean = ~ mean(., na.rm = TRUE),
max = ~ max(., na.rm = TRUE)
)
iris %>%
group_by(Species) %>%
summarise(across(is.numeric, my_func, .names = "{fn}.{col}"))
#> # A tibble: 3 x 9
#> Species mean.Sepal.Length max.Sepal.Length mean.Sepal.Width max.Sepal.Width
#> * <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 5.01 5.8 3.43 4.4
#> 2 versicolor 5.94 7 2.77 3.4
#> 3 virginica 6.59 7.9 2.97 3.8
#> mean.Petal.Length max.Petal.Length mean.Petal.Width max.Petal.Width
#> * <dbl> <dbl> <dbl> <dbl>
#> 1 1.46 1.9 0.246 0.6
#> 2 4.26 5.1 1.33 1.8
#> 3 5.55 6.9 2.03 2.5
这篇关于使用dplyr group_by时将汇总条件应用于一系列列吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!