加速随机森林的建议 | 加速随机森林的建议

本文介绍了加速随机森林的建议的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用 randomForest 包做一些工作，虽然它运行良好，但可能很耗时.有人对加快速度有什么建议吗?我正在使用带有双核 AMD 芯片的 Windows 7 盒子.我知道 R 不是多线程/处理器，但很好奇是否有任何并行包(rmpi、snow、snowfall 等.) 为 randomForest 工作.谢谢.

I'm doing some work with the randomForest package and while it works well, it can be time-consuming. Any one have any suggestions for speeding things up? I'm using a Windows 7 box w/ a dual core AMD chip. I know about R not being multi- thread/processor, but was curious if any of the parallel packages (rmpi, snow, snowfall, etc.) worked for randomForest stuff. Thanks.

我正在使用 rF 进行一些分类工作(0 和 1).数据有大约 8-12 个变量列，训练集是 10k 行的样本，所以它的大小合适但并不疯狂.我正在运行 500 棵树，mtry 为 2、3 或 4.

I'm using rF for some classification work (0's and 1's). The data has about 8-12 variable columns and the training set is a sample of 10k lines, so it's decent size but not crazy. I'm running 500 trees and an mtry of 2, 3, or 4.

编辑 2:这是一些输出:

EDIT 2:Here's some output:

> head(t22)
  Id Fail     CCUse Age S-TFail         DR MonInc #OpenLines L-TFail RE M-TFail Dep
1  1    1 0.7661266  45       2 0.80298213   9120         13       0  6       0   2
2  2    0 0.9571510  40       0 0.12187620   2600          4       0  0       0   1
3  3    0 0.6581801  38       1 0.08511338   3042          2       1  0       0   0
4  4    0 0.2338098  30       0 0.03604968   3300          5       0  0       0   0
5  5    0 0.9072394  49       1 0.02492570  63588          7       0  1       0   0
6  6    0 0.2131787  74       0 0.37560697   3500          3       0  1       0   1
> ptm <- proc.time()
>
> RF<- randomForest(t22[,-c(1,2,7,12)],t22$Fail
+                    ,sampsize=c(10000),do.trace=F,importance=TRUE,ntree=500,,forest=TRUE)
Warning message:
In randomForest.default(t22[, -c(1, 2, 7, 12)], t22$Fail, sampsize = c(10000),  :
  The response has five or fewer unique values.  Are you sure you want to do regression?
> proc.time() - ptm
   user  system elapsed
 437.30    0.86  450.97
>

推荐答案

foreach 包的手册中有一节关于并行随机森林(使用 foreach 包，第 5.1 节):

The manual of the foreach package has a section on Parallel Random Forests(Using The foreach Package, Section 5.1):

> library("foreach")
> library("doSNOW")
> registerDoSNOW(makeCluster(4, type="SOCK"))

> x <- matrix(runif(500), 100)
> y <- gl(2, 50)

> rf <- foreach(ntree = rep(250, 4), .combine = combine, .packages = "randomForest") %dopar%
+    randomForest(x, y, ntree = ntree)
> rf
Call:
randomForest(x = x, y = y, ntree = ntree)
Type of random forest: classification
Number of trees: 1000

如果我们想创建一个有 1000 棵树的随机森林模型，而我们的计算机有四个核心，我们可以通过执行 randomForest 函数四次将问题分成四部分，ntree 参数设置为 250.当然，我们必须合并结果randomForest 对象，但 randomForest 包带有一个名为 combine 的函数.

If we want want to create a random forest model with a 1000 trees, and our computer has fourcores, we can split up the problem into four pieces by executing the randomForest function four times, with the ntree argument set to 250. Of course, we have to combine the resulting randomForest objects, but the randomForest package comes with a function called combine.

这篇关于加速随机森林的建议的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！