本文介绍了使用dplyr group_by时将汇总条件应用于一系列列吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我们要 group_by() summerise 包含许多列的大型data.frame,但这有一些大的连续列将具有相同的汇总条件(例如 max 平均值等)

Suppose we want to group_by() and summarise a massive data.frame with very many columns, but that there are some large groups of consecutive columns that will have the same summarise condition (e.g. max, mean etc)

有没有一种方法可以避免必须指定摘要条件,并为列范围设置条件?

Is there a way to avoid having to specify the summarise condition for each and every column, and instead do it for ranges of columns?

假设我们要这样做:

iris %>% 
  group_by(Species) %>% 
  summarise(max(Sepal.Length), mean(Sepal.Width), mean(Petal.Length), mean(Petal.Width))

但请注意,连续3列具有相同的总结条件,平均值(Sepal.Width),平均值(Petal.Length),平均值(Petal.Width)

but note that 3 consecutive columns have the same summarise condition, mean(Sepal.Width), mean(Petal.Length), mean(Petal.Width)

是否可以使用 mean(Sepal.Width:Petal.Width)这样的方法来指定列范围的条件,因此避免了必须为之间的所有列多次键入汇总条件)

Is there a way to use some method like mean(Sepal.Width:Petal.Width) to specify the condition for the range of columns, and hence a avoiding having to type out the summarise condition multiple times for all the columns in between)

上面的虹膜示例是一个小型且易于管理的示例,具有3个连续列的范围,但实际用例有数百个。

The iris example above is a small and manageable example that has a range of 3 consecutive columns, but actual use case has ~hundreds.

推荐答案

即将发布的版本的 dplyr 将具有函数可以实现您想要的

The upcoming version 1.0.0 of dplyr will have across() function that does what you wish for

across() 有两个主要参数:


  • 第一个参数 .cols ,选择要操作的列。
    它使用整洁的选择(例如 select()),因此您可以按
    的位置,名称和类型来选择变量。

  • The first argument, .cols, selects the columns you want to operate on. It uses tidy selection (like select()) so you can pick variables by position, name, and type.

第二个参数 .fns 是一个函数或函数列表,适用于每列
。这也可以是Purrr样式的公式(或公式列表)
,例如〜.x / 2 。 (此参数是可选的,如果只希望
来获取基础数据,则可以忽略它;您会看到
vignette( rowwise)。)

The second argument, .fns, is a function or list of functions to apply to each column. This can also be a purrr style formula (or list of formulas) like ~ .x / 2. (This argument is optional, and you can omit it if you just want to get the underlying data; you'll see that technique used in vignette("rowwise").)





### Install development version on GitHub first
# install.packages("devtools")
# devtools::install_github("tidyverse/dplyr")
library(dplyr, warn.conflicts = FALSE)

控制方式名称是使用 .names 参数创建的,该参数采用规范:

Control how the names are created with the .names argument which takes a glue spec:

iris %>% 
  group_by(Species) %>% 
  summarise(
    across(c(Sepal.Width:Petal.Width), ~ mean(.x, na.rm = TRUE), .names = "mean_{col}"),
    across(c(Sepal.Length), ~ max(.x, na.rm = TRUE), .names = "max_{col}")
    )
#> # A tibble: 3 x 5
#>   Species    mean_Sepal.Width mean_Petal.Leng~ mean_Petal.Width max_Sepal.Length
#> * <fct>                 <dbl>            <dbl>            <dbl>            <dbl>
#> 1 setosa                 3.43             1.46            0.246              5.8
#> 2 versicolor             2.77             4.26            1.33               7  
#> 3 virginica              2.97             5.55            2.03               7.9

使用多种功能

my_func <- list(
  mean = ~ mean(., na.rm = TRUE),
  max  = ~ max(., na.rm = TRUE)
)

iris %>%
  group_by(Species) %>%
  summarise(across(is.numeric, my_func, .names = "{fn}.{col}"))
#> # A tibble: 3 x 9
#>   Species    mean.Sepal.Length max.Sepal.Length mean.Sepal.Width max.Sepal.Width
#> * <fct>                  <dbl>            <dbl>            <dbl>           <dbl>
#> 1 setosa                  5.01              5.8             3.43             4.4
#> 2 versicolor              5.94              7               2.77             3.4
#> 3 virginica               6.59              7.9             2.97             3.8
#>   mean.Petal.Length max.Petal.Length mean.Petal.Width max.Petal.Width
#> *             <dbl>            <dbl>            <dbl>           <dbl>
#> 1              1.46              1.9            0.246             0.6
#> 2              4.26              5.1            1.33              1.8
#> 3              5.55              6.9            2.03              2.5

这篇关于使用dplyr group_by时将汇总条件应用于一系列列吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-27 16:33