本文介绍了如何使用summarise_at将不同的函数应用于不同的列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含以下列的数据框:

 > colnames(my.dataframe)
[1] id名字 lastName
[4]位置 jerseyNumber currentTeamId
[7] currentTeamAbbreviation currentRosterStatus 身高
[10]体重 birthDate年龄
[13] birthCity birthCountry新秀
[16] handednessShoots学院 twitter
[19] currentInjuryDescription currentInjuryPlayingProbability teamId
[22] teamAbbreviation fg2PtAtt fg3PtAtt
[25] fg2PtMade fg3PtMade ft Made
[28] fg2PtPct fg3PtPct ftPct
[31] ast tov offReb
[34] foulsDrawn blkAgainst plusMinus
[37] minSeconds

这是我的无效代码:

  my.dataframe%>%
dplyr :: group_by(id)%>%
dplyr :: summarise_at(vars(firstName:currentInjuryPlayingProbability),funs(min),na.rm = TRUE)%&%;%
dplyr :: summarise_at(vars(fg2PtAtt:minSeconds),funs(sum),na。 rm = TRUE)%>%
vars(),funs(min),na.rm = TRUE)%&%;%
dplyr :: summarise(teamId = paste(teamId),teamAbbreviation =粘贴(teamAbbreviation))

第一个I按ID分组(这不是我数据框中的唯一列d尽管它被称为id)。对于直到currentInjuryPlayingProbability的接下来的19列,当按ID分组时,这些列始终是相同的,因此我使用 min 函数汇总/获取值。



接下来,我想总结从 fg2PtAtt 到末尾的所有列的平均值(这些列都是数字/整数)。



最后,对于teamId和teamAbbreviation列(当grouped_by id时不相同),我想将它们粘贴到单个字符串中,每个字符串都具有摘要。



我的方法行不通,因为我认为我不能先调用summarise_at,再调用另一个summarise_at,再进行总结。到第二个summarise_at调用时,试图汇总的列已被第一个summarise_at删除。



对此有任何帮助,我将不胜感激!

编辑:

  dput(my.dataframe)
结构(list(id = c(10138L,9466L,9360L,9360L),firstName = c( Alex,
Quincy, Luke, Luke),lastName = c( Abrines, Acy, Babbitt,
Babbitt),currentInjuryPlayingProbability = c(NA_character _,
NA_character_,NA_character_,NA_character_), teamId = c(96L,
84L,91L,92L),teamAbbreviation = c( OKL, BRO, ATL, MIA
),fg2PtAtt = c(70L,73L, 57L,2L),fg3PtAtt = c(221L,292L,
111L,45L),minSeconds = c(67637L,81555L,34210L,8676L)),行名= c(NA,
-4L ),class = c( tbl_df, tbl, data.frame))

my.dataframe
id firstName lastName currentInjuryPlayingProbability teamI d teamAfbreviation fg2PtAtt fg3PtAtt minSeconds
< int> < chr> < chr> < chr> < int> < chr> < int> < int> < int>
1 10138 Alex Abrines< NA> 96 OKL 70 221 67637
2 9466 Quincy Acy< NA> 84 BRO 7329281555
3 9360 Luke Babbitt< NA> 91 ATL 57111 34210
4 9360 Luke Babbitt< NA> 92 MIA 2 45 8676

这是一个简短的示例,只有9列,但有足够的数据突出显示问题。生成的数据帧应如下所示:

  id firstName lastName currentInjuryPlayingProbability teamId teamA team缩写缩写gg2PtAtt fg3PtAtt < chr> < chr> < chr> < chr> < chr> < int> < int> < int> 
1 10138 Alex Abrines< NA> 96 OKL 70 221 67637
2 9466 Quincy Acy< NA> 84 BRO 7329281555
3 9360 Luke Babbitt< NA> 91,92 ATL,MIA 57 156 42886


解决方案

我认为这是完成此特定任务的最简单方法,至少与某些类似的 map2 / reduce 解决方案相比



第一点是,如果您使用 min 来获取值,因为您认为应该对于分组变量的每个值都相同,只需将其添加到分组中即可。然后它会自动保存。



第二个是您可以使用 {} 覆盖LHS的自动放置RHS的第一个参数中的%>%。这样一来,您就可以应用不同的转换并重新组合它们。通常您不需要这样做,因为占位符会为您完成此操作,但是如果占位符不是RHS的裸露论点,则有时会需要它。 (我确定已经阅读了一些描述确切规则的资源,但现在找不到。)



第三是因为您知道总结将删除除分组变量之外未选择的列, left_join 将自动使用共享的列名称进行连接。



这意味着我们可以做以下事情,我认为这很干净。但是,如果转换开始变得特别复杂(例如 left_join 中有管道),我建议给最终输出的每个部分赋予其自己的赋值和名称,以便更加清楚。如果您想要同一列的多个摘要(例如均值和标准差),则也要小心,因为在书写时,名称会发生​​冲突。

  library(tidyverse)

my_dataframe<-structure(list(id = c(10138L, 9466L,9360L,9360L),firstName = c( Alex, Quincy, Luke, Luke),lastName = c( Abrines, Acy, Babbitt, Babbitt),currentInjuryPlayingProbability = c(NA_character_,NA_character_,NA_character_,NA_character_),teamId = c(96L,84L,91L,92L),teamAbbreviation = c( OKL, BRO, ATL, MIA),fg2PtAtt = c( 70L,73L,57L,2L),fg3PtAtt = c(221L,292L,111L,45L),minSeconds = c(67637L,81555L,34210L,8676L)),行名= c(NA,-4L),class = c( tb l_df, tbl, data.frame))

my_dataframe%>%
group_by_at(.vars = vars(id:lastName))%&%;%
{left_join(
summarise_at(。,vars(teamId:teamAbbreviation),〜str_c(。,collapse =,)),
summarise_at(。,vars(fg2PtAtt:minSeconds),mean)
}}
#>通过= c( id, firstName, lastName)加入
#> #小动作:3 x 8
#> #组:id,名字[?]
#> id firstName lastName teamId teamAbbreviation fg2PtAtt fg3PtAtt
#> < int> < chr> < chr> < chr> < chr> < dbl> < dbl>
#> 1 9360 Luke Babbitt 91,92 ATL,MIA 29.5 78
#> 2 9466 Quincy Acy 84 BRO 73292
#> 3 10138 Alex Abrines 96 OKL 70221
#> #...还有1个变量:minSeconds< dbl>

由(v0.2.0)。


I have a dataframe with the following columns:

> colnames(my.dataframe)
 [1] "id"                              "firstName"                       "lastName"
 [4] "position"                        "jerseyNumber"                    "currentTeamId"
 [7] "currentTeamAbbreviation"         "currentRosterStatus"             "height"
[10] "weight"                          "birthDate"                       "age"
[13] "birthCity"                       "birthCountry"                    "rookie"
[16] "handednessShoots"                "college"                         "twitter"
[19] "currentInjuryDescription"        "currentInjuryPlayingProbability" "teamId"
[22] "teamAbbreviation"                "fg2PtAtt"                        "fg3PtAtt"
[25] "fg2PtMade"                       "fg3PtMade"                       "ftMade"
[28] "fg2PtPct"                        "fg3PtPct"                        "ftPct"
[31] "ast"                             "tov"                             "offReb"
[34] "foulsDrawn"                      "blkAgainst"                      "plusMinus"
[37] "minSeconds"

And here is my code that isn't working:

my.dataframe %>%
  dplyr::group_by(id) %>%
  dplyr::summarise_at(vars(firstName:currentInjuryPlayingProbability), funs(min), na.rm = TRUE) %>%
  dplyr::summarise_at(vars(fg2PtAtt:minSeconds), funs(sum), na.rm = TRUE) %>%
                    vars(), funs(min), na.rm = TRUE) %>%
  dplyr::summarise(teamId = paste(teamId), teamAbbreviation = paste(teamAbbreviation))

First I group by id (which is not a unique column in my dataframe, despite it being called id). For the next 19 columns up until currentInjuryPlayingProbability, these columns are always the same when grouped_by the ID, and so I use the min function to summarise / grab the value.

Next, I want to summarise all columns from fg2PtAtt to the end with the mean value (these columns are all numeric / integer).

Lastly, for the columns teamId and teamAbbreviation (which are not the same when grouped_by id), I want to paste them into a single string each with summarise.

My approach doesn't work because I don't think I can call summarise_at, followed by another summarise_at, followed by a summarise. By the time the second summarise_at is called, the columns trying to be summarised were already removed by the first summarise_at

Any help with this is appreciated!I will update with a subset of my dataframe shortly that code can be tested on.

EDIT:

dput(my.dataframe)
structure(list(id = c(10138L, 9466L, 9360L, 9360L), firstName = c("Alex",
"Quincy", "Luke", "Luke"), lastName = c("Abrines", "Acy", "Babbitt",
"Babbitt"), currentInjuryPlayingProbability = c(NA_character_,
NA_character_, NA_character_, NA_character_), teamId = c(96L,
84L, 91L, 92L), teamAbbreviation = c("OKL", "BRO", "ATL", "MIA"
), fg2PtAtt = c(70L, 73L, 57L, 2L), fg3PtAtt = c(221L, 292L,
111L, 45L), minSeconds = c(67637L, 81555L, 34210L, 8676L)), row.names = c(NA,
-4L), class = c("tbl_df", "tbl", "data.frame"))

my.dataframe
     id firstName lastName currentInjuryPlayingProbability teamId teamAbbreviation fg2PtAtt fg3PtAtt minSeconds
  <int> <chr>     <chr>    <chr>                            <int> <chr>               <int>    <int>      <int>
1 10138 Alex      Abrines  <NA>                                96 OKL                    70      221      67637
2  9466 Quincy    Acy      <NA>                                84 BRO                    73      292      81555
3  9360 Luke      Babbitt  <NA>                                91 ATL                    57      111      34210
4  9360 Luke      Babbitt  <NA>                                92 MIA                     2       45       8676

here is a shorted example with only 9 columns, but with enough data to highlight the problems. The resulting dataframe should look like this:

    id firstName lastName currentInjuryPlayingProbability teamId teamAbbreviation fg2PtAtt fg3PtAtt minSeconds
  <int> <chr>     <chr>    <chr>                            <chr>     <chr>               <int>    <int>      <int>
1 10138 Alex      Abrines  <NA>                                96      OKL                    70      221      67637
2  9466 Quincy    Acy      <NA>                                84      BRO                    73      292      81555
3  9360 Luke      Babbitt  <NA>                            91, 92 ATL, MIA                     57      156      42886
解决方案

This is what I think is the simplest way for this particular task, at least compared to some similar map2/reduce solutions I've seen.

First point is that if you are using min to grab a value because you think it should be the same for every value of your grouping variable, just add it to the grouping. Then it is automatically preserved.

Second is that you can use {} to override the automatic placement of the LHS of %>% into the first argument of the RHS. This lets you in a single step apply different transformations and recombine them. Usually you don't need this because the placeholder . will do it for you, but if the placeholder is not a naked argument to the RHS you sometimes need it. (I am sure I read some resource that describes the exact rules but I can't find it right now).

Third is that because you know that summarise will drop columns you didn't select except the grouping variables, left_join will automatically use the shared column names to join on.

This means that we can do the following, which I think is pretty clean. If the transformations start to get particularly complex though (like if there are pipes inside the left_join I would recommend giving each piece of the final output its own assignment and name, to be clearer. You also need to be careful if you want more than one summary of the same column (like both mean and standard deviation), because as written the names will collide.

library(tidyverse)

my_dataframe <- structure(list(id = c(10138L, 9466L, 9360L, 9360L), firstName = c("Alex", "Quincy", "Luke", "Luke"), lastName = c("Abrines", "Acy", "Babbitt", "Babbitt"), currentInjuryPlayingProbability = c(NA_character_, NA_character_, NA_character_, NA_character_), teamId = c(96L, 84L, 91L, 92L), teamAbbreviation = c("OKL", "BRO", "ATL", "MIA"), fg2PtAtt = c(70L, 73L, 57L, 2L), fg3PtAtt = c(221L, 292L, 111L, 45L), minSeconds = c(67637L, 81555L, 34210L, 8676L)), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"))

my_dataframe %>%
  group_by_at(.vars = vars(id:lastName)) %>%
  {left_join(
    summarise_at(., vars(teamId:teamAbbreviation), ~ str_c(., collapse = ",")),
    summarise_at(., vars(fg2PtAtt:minSeconds), mean)
  )}
#> Joining, by = c("id", "firstName", "lastName")
#> # A tibble: 3 x 8
#> # Groups:   id, firstName [?]
#>      id firstName lastName teamId teamAbbreviation fg2PtAtt fg3PtAtt
#>   <int> <chr>     <chr>    <chr>  <chr>               <dbl>    <dbl>
#> 1  9360 Luke      Babbitt  91,92  ATL,MIA              29.5       78
#> 2  9466 Quincy    Acy      84     BRO                  73        292
#> 3 10138 Alex      Abrines  96     OKL                  70        221
#> # ... with 1 more variable: minSeconds <dbl>

Created on 2018-07-31 by the reprex package (v0.2.0).

这篇关于如何使用summarise_at将不同的函数应用于不同的列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-20 10:26