R中随机森林的并行执行

本文介绍了R中随机森林的并行执行的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在R中并行运行随机森林

I am running random forest in R in parallel

library(doMC)
registerDoMC()
x <- matrix(runif(500), 100)
y <- gl(2, 50)

并行执行(耗时73秒)

Parallel execution (took 73 sec)

rf <- foreach(ntree=rep(25000, 6), .combine=combine, .packages='randomForest') %dopar%
randomForest(x, y, ntree=ntree)

顺序执行(耗时82秒)

Sequential execution (took 82 sec)

rf <- foreach(ntree=rep(25000, 6), .combine=combine) %do%
randomForest(x, y, ntree=ntree)

在并行执行中，生成树的速度非常快，大约需要3到7秒，但是其余时间却在组合结果时消耗了(组合选项).因此，运行并行执行唯一值得的是，树的数量确实很高.有什么办法可以调整组合"选项，从而避免在不需要的每个节点上进行任何计算并使其更快

In parallel execution, the tree generation is pretty quick like 3-7 sec, but the rest of the time is consumed in combining the results (combine option). So, its only worth to run parallel execution is the number of trees are really high. Is there any way I can tweak "combine" option to avoid any calculation at each node which I dont need and make it more faster

PS.以上只是数据示例.实际上，我有约10万个功能可用于约100个观察.

PS. Above is just an example of data. In real I have some 100 thousands features for some 100 observations.

推荐答案

将.multicombine设置为TRUE会产生很大的不同:

Setting .multicombine to TRUE can make a significant difference:

rf <- foreach(ntree=rep(25000, 6), .combine=randomForest::combine,
              .multicombine=TRUE, .packages='randomForest') %dopar% {
    randomForest(x, y, ntree=ntree)
}

这将导致combine被调用一次，而不是被调用五次.在我的台式机上，此过程只需8秒，而不是19秒.

This causes combine to be called once rather than five times. On my desktop machine, this runs in 8 seconds rather than 19 seconds.

这篇关于R中随机森林的并行执行的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！