问题描述
什么是随机森林中的出库错误?是在随机森林中找到正确数量的树木的最佳参数吗?
What is out of bag error in Random Forests?Is it the optimal parameter for finding the right number of trees in a Random Forest?
推荐答案
我将尝试解释:
假设我们的训练数据集由T表示,并且假设数据集具有M个特征(或属性或变量).
Suppose our training data set is represented by T and suppose data set has M features (or attributes or variables).
T = {(X1,y1), (X2,y2), ... (Xn, yn)}
和
Xi is input vector {xi1, xi2, ... xiM}
yi is the label (or output or class).
RF摘要:
Random Forests算法是主要基于两种方法的分类器-
Random Forests algorithm is a classifier based on primarily two methods -
- 装袋
- 随机子空间方法.
假设我们决定在森林中拥有S
树木的数量,然后我们首先创建"same size as original"
的S
个数据集,该数据集是通过对T中的数据进行随机重采样替换而创建的(每个数据集n次).这将产生{T1, T2, ... TS}
数据集.这些中的每一个都称为引导程序数据集.由于有替换",每个数据集Ti
可能具有重复的数据记录,并且Ti可能会丢失原始数据集中的多个数据记录.这称为Bootstrapping
. (en.wikipedia.org/wiki/Bootstrapping_(statistics))
Suppose we decide to have S
number of trees in our forest then we first create S
datasets of "same size as original"
created from random resampling of data in T with-replacement (n times for each dataset). This will result in {T1, T2, ... TS}
datasets. Each of these is called a bootstrap dataset. Due to "with-replacement" every dataset Ti
can have duplicate data records and Ti can be missing several data records from original datasets. This is called Bootstrapping
. (en.wikipedia.org/wiki/Bootstrapping_(statistics))
装袋是获取引导程序的过程;然后汇总在每个引导程序中学习的模型.
Bagging is the process of taking bootstraps & then aggregating the models learned on each bootstrap.
现在,RF创建S
树并使用M
可能特征中的m (=sqrt(M) or =floor(lnM+1))
随机子特征来创建任何树.这称为随机子空间方法.
Now, RF creates S
trees and uses m (=sqrt(M) or =floor(lnM+1))
random subfeatures out of M
possible features to create any tree. This is called random subspace method.
因此,为每个Ti
引导数据集创建一个树Ki
.如果要对某些输入数据D = {x1, x2, ..., xM}
进行分类,可以让它通过每棵树并产生S
输出(每棵树一个),可以用Y = {y1, y2, ..., ys}
表示.最终预测是对此集的多数表决.
So for each Ti
bootstrap dataset you create a tree Ki
. If you want to classify some input data D = {x1, x2, ..., xM}
you let it pass through each tree and produce S
outputs (one for each tree) which can be denoted by Y = {y1, y2, ..., ys}
. Final prediction is a majority vote on this set.
袋外错误:
创建分类器(S
树)后,对于原始训练集中的每个(Xi,yi)
,即T
,选择所有不包含(Xi,yi)
的Tk
.请注意,该子集是一组boostrap数据集,其中不包含原始数据集中的特定记录.此集合称为袋外示例".有n
个这样的子集(原始数据集T中的每个数据记录一个). OOB分类器是仅在Tk
上的投票汇总,因此不包含(xi,yi)
.
After creating the classifiers (S
trees), for each (Xi,yi)
in the original training set i.e. T
, select all Tk
which does not include (Xi,yi)
. This subset, pay attention, is a set of boostrap datasets which does not contain a particular record from the original dataset. This set is called out-of-bag examples. There are n
such subsets (one for each data record in original dataset T). OOB classifier is the aggregation of votes ONLY over Tk
such that it does not contain (xi,yi)
.
泛化误差的袋外估计值是训练集上袋外分类器的错误率(与已知的yi
进行比较).
Out-of-bag estimate for the generalization error is the error rate of the out-of-bag classifier on the training set (compare it with known yi
's).
为什么重要?
(感谢@Rudolf的更正.他在下面的评论.)
(Thanks @Rudolf for corrections. His comments below.)
这篇关于什么是“随机森林"中的出库错误?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!