随机森林模型中预测结果的差异

本文介绍了随机森林模型中预测结果的差异的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经建立了一个随机森林模型，当我编写两行不同的代码来生成预测时，我得到了两个不同的预测结果.我想知道哪一个是正确的.这是我的示例数据框和使用的代码:

I have built an Random Forest model and I got two different prediction results when I wrote two different lines of code in order to generate the prediction. I wonder which one is the right one.Here is my example dataframe and the usedcode:

dat <- read.table(text = " cats birds    wolfs     snakes
      0        3        9         7
      1        3        8         4
      1        1        2         8
      0        1        2         3
      0        1        8         3
      1        6        1         2
      0        6        7         1
      1        6        1         5
      0        5        9         7
      1        3        8         7
      1        4        2         7
      0        1        2         3
      0        7        6         3
      1        6        1         1
      0        6        3         9
      1        6        1         1   ",header = TRUE)

我已经建立了一个随机森林模型:

I've built a random forest model:

model<-randomForest(snakes~cats+birds+wolfs,data=dat,ntree=20)
RF_pred<- data.frame(predict(model))
train<-cbind(train,RF_pred) # this gave me a predictive results named: "predict.model."

出于好奇，我用这行代码尝试了另一种语法:

I tryed another syntax out of curiosity with this line of code:

dat$RF_pred<-predict(model,newdata=dat,type='response') # this gave me a predictive results named: "RF_pred"

令我惊讶的是，我得到了其他预测结果:

to my suprise I got other predictive results:

 dat
   cats birds wolfs snakes predict.model.  RF_pred
1     0     3     9      7       3.513889 5.400675
2     1     3     8      4       5.570000 5.295417
3     1     1     2      8       3.928571 5.092917
4     0     1     2      3       4.925893 4.208452
5     0     1     8      3       4.583333 4.014008
6     1     6     1      2       3.766667 2.943750
7     0     6     7      1       5.486806 4.061508
8     1     6     1      5       3.098148 2.943750
9     0     5     9      7       4.575397 5.675675
10    1     3     8      7       4.729167 5.295417
11    1     4     2      7       4.416667 5.567917
12    0     1     2      3       4.222619 4.208452
13    0     7     6      3       6.125714 4.036508
14    1     6     1      1       3.695833 2.943750
15    0     6     3      9       4.115079 5.178175
16    1     6     1      1       3.595238 2.943750

为什么会有差异.两者之间?哪一个是正确的?有什么想法吗?

Why Is there a diff. between the two? Which one is the correct one?Any Ideas?

推荐答案

区别在于两次调用 predict :

The difference is in the two calls to predict:

predict(model)

和

predict(model, newdata=dat)

第一个选项从随机森林中获取对训练数据的out-of-bag 预测.在将预测值与实际值进行比较时，这通常是您想要的.

The first option gets the out-of-bag predictions on your training data from the random forest. This is generally what you want, when comparing predicted values to actuals.

第二个将您的训练数据视为新数据集，并沿着每棵树运行观察.这将导致预测和实际之间人为地密切相关，因为 RF 算法通常不会修剪单个树，而是依赖于树的集合来控制过度拟合.因此，如果您想对训练数据进行预测，请不要这样做.

The second treats your training data as if it was a new dataset, and runs the observations down each tree. This will result in an artificially close correlation between the predictions and the actuals, since the RF algorithm generally doesn't prune the individual trees, relying instead on the ensemble of trees to control overfitting. So don't do this if you want to get predictions on the training data.

这篇关于随机森林模型中预测结果的差异的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！