本文介绍了R:使用替代IFELSE创建数据帧的最快方式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个类似的问题这个线程上的一个:
使用R替换矩阵中的所有值

I have a similar question this the one on this thread:Using R, replace all values in a matrix <0.1 with 0?

但在我的情况下具有假设更大的数据集和可变阈值。
我需要使用相同数据帧的第一列上的值从条件中检索每个值来创建数据帧。这些值对于每一行都是不同的。

But in my case I have hypothetically larger dataset and variable thresholds. I need to create a dataframe with each value retrieved from a condition using the values on the first columns of the same dataframe. These values are different for each line.

下面是数据框的一个例子:

Here is an example of the dataframe:

SNP        A1  A2   MAF     
rs3094315  G   A   0.172  
rs7419119  G   T   0.240  
rs13302957 G   A   0.081  
rs6696609  T   C   0.393 

以下是我的代码示例:

seqIndividuals = seq(1:201)
for(i in seqIndividuals) {
  alFrequ[paste("IND",i,"a",sep="")] = ifelse(runif(length(alFrequ$SNP),0.00,1.00) < alFrequ$MAF, alFrequ$A1, alFrequ$A2)
  alFrequ[paste("IND",i,"b",sep="")] = ifelse(runif(length(alFrequ$SNP),0.00,1.00) < alFrequ$MAF, alFrequ$A1, alFrequ$A2)
}

我在seqIndividuals中为每个单独的i创建两个新列,如果值低于MAF或A2的列,则从列A1如果更高。代码工作很好,但是随着数据集的增长,行和列(个人)的时间也会显着增长。

I am creating two new columns for each individual "i" in "seqIndividuals" by retrieving either values from column "A1" if a random value if lower than column "MAF", or "A2" if higher. The code is working great, but as a dataset grows in rows and columns (individuals) the time also grows significantly.

有没有办法避免在这种情况下使用IFELSE因为我明白它是循环的?我尝试生成一个随机值矩阵,然后替换它们,但是它需要相同的时间甚至更长时间。

Is there a way to avoid using IFELSE for this situation, as I understand it works as a loop? I tried generating a matrix of random values and then replacing them, but it takes the same time or even longer.

mtxAlFrequ = matrix(runif(length(alFrequ$SNP)*(201)),nrow=length(alFrequ$SNP),ncol=201)
mtxAlFrequ[mtxAlFrequ < alFrequ$MAF] = alFrequ$A1

谢谢!

推荐答案

一个选项是 data.table

library(data.table)
nm1 <- paste0("IND", rep(letters[1:2], length(seqIndividuals)), 
                    rep(seqIndividuals, each = 2))
setDT(alFrequ)
for(j in seq_along(nm1)) {
      alFrequ[, nm1[j] := A2
             ][runif(.N, 0, 1) < MAF , nm1[j] := A1][]
}



基准



Benchmarks

set.seed(24)
alFrequ <- data.frame(SNP= paste0('rs', sample(600000, 340000, replace=FALSE)),
                   A1 = sample(c("G", "T", "A", "C"), 340000, replace=TRUE),
                   A2 = sample(c("G", "T", "A", "C"), 340000, replace=TRUE),
                   MAF = runif(340000, 0, 1), stringsAsFactors=FALSE)
nm1 <- paste0("IND", rep(letters[1:2], length(seqIndividuals)), 
                          rep(seqIndividuals, each = 2))

system.time({
    setDT(alFrequ)
     for(j in seq_along(nm1)){
     alFrequ[, nm1[j] := A2][runif(.N, 0, 1) < MAF , nm1[j] := A1][]
   }
})
#   user  system elapsed 
#  10.72    1.05   11.76 

并在原始数据集上使用OP的代码

and using the OP's code on the original dataset

system.time({
 for(i in seqIndividuals) {
   alFrequ[paste("IND",i,"a",sep="")] = ifelse(runif(length(alFrequ$SNP),0.00,1.00) < 
          alFrequ$MAF, alFrequ$A1, alFrequ$A2)
   alFrequ[paste("IND",i,"b",sep="")] = ifelse(runif(length(alFrequ$SNP),0.00,1.00) < 
             alFrequ$MAF, alFrequ$A1, alFrequ$A2)
 }
})
#    user  system elapsed 
#   72.16    6.82   79.33 

这篇关于R:使用替代IFELSE创建数据帧的最快方式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!