本文介绍了折叠行,其中一些都是 NA,另一些与一些 NA 不相交的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个简单的数据框:

I have a simple dataframe as such:

ID    Col1    Col2    Col3    Col4
1     NA      NA      NA      NA
1     5       10      NA      NA
1     NA      NA      15      20
2     NA      NA      NA      NA
2     25      30      NA      NA
2     NA      NA      35      40

我想重新格式化它:

ID    Col1    Col2    Col3    Col4
1     5       10      15      20
2     25      30      35      40

(请注意:真实数据集有数千行,值来自生物数据——NA 不遵循简单的模式,除了 NAs 是不相交的,是的,每个 ID 正好有 3 行).

(please note: the real data set has thousands of rows and the values are from biological data -- the NAs follow no simple pattern, except that the NAs are disjoint, and yes there are exactly 3 rows for each ID).

第一步:删除只有 NA 值的行.

STEP ONE: get rid of rows that have only NA values.

表面上看起来很简单,但我遇到了一些问题.

On the surface this looked simple, but I've run across some problems.

complete.cases(DF) 返回所有 FALSE,所以我不能真正使用它来删除所有 NA 的行,如 DF[complete.cases(DF),].这是因为所有行都至少包含一个 NA.

complete.cases(DF) returns all FALSE, so I can't really use this to remove the rows with all NAs, as in DF[complete.cases(DF),]. This is because all rows contain at least one NA.

由于 NA 想要传播自己,其他使用 is.na 的方案因同样的原因而失败.

Since NAs want to propagate themselves, other schemes using is.na fail for the same reason.

第二步:将剩余的两行合并为一.

STEP TWO: collapse the remaining two rows into one.

考虑使用像 aggregate 这样的东西来实现这一点,但必须有比 this,这根本不起作用.

Thinking about using something like aggregate to pull this off, but there has got to be an easier way than this, which doesn't work at all.

感谢您的建议.

推荐答案

试试

library(dplyr)
DF %>% group_by(ID) %>% summarise_each(funs(sum(., na.rm = TRUE)))

为了说明某一列对于某个 ID 的所有 NAs 的情况,我们需要 sum_NA() 函数返回 NA 如果所有都是 NAs


To account for the case in which one column has all NAs for a certain ID, we need sum_NA() function which returns NA if all are NAs

txt <- "ID    Col1    Col2    Col3    Col4
        1     NA      NA      NA      NA
        1     5       10      NA      NA
        1     NA      NA      15      20
        2     NA      NA      NA      NA
        2     NA      30      NA      NA
        2     NA      NA      35      40"
DF <- read.table(text = txt, header = TRUE)

# original code
DF %>%
  group_by(ID) %>%
  summarise_each(funs(sum(., na.rm = TRUE)))

# `summarise_each()` is deprecated.
# Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead.
# To map `funs` over all variables, use `summarise_all()`
# A tibble: 2 x 5
     ID  Col1  Col2  Col3  Col4
  <int> <int> <int> <int> <int>
1     1     5    10    15    20
2     2     0    30    35    40

sum_NA <- function(x) {if (all(is.na(x))) x[NA_integer_] else sum(x, na.rm = TRUE)}

DF %>%
  group_by(ID) %>%
  summarise_all(funs(sum_NA))

DF %>%
  group_by(ID) %>%
  summarise_if(is.numeric, funs(sum_NA))

# A tibble: 2 x 5
     ID  Col1  Col2  Col3  Col4
  <int> <int> <int> <int> <int>
1     1     5    10    15    20
2     2    NA    30    35    40

这篇关于折叠行,其中一些都是 NA,另一些与一些 NA 不相交的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

05-19 04:17