什么是“随机森林"中的出库错误?

本文介绍了什么是“随机森林"中的出库错误?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

什么是随机森林中的出库错误?是在随机森林中找到正确数量的树木的最佳参数吗?

What is out of bag error in Random Forests?Is it the optimal parameter for finding the right number of trees in a Random Forest?

推荐答案

我将尝试解释:

假设我们的训练数据集由T表示，并且假设数据集具有M个特征(或属性或变量).

Suppose our training data set is represented by T and suppose data set has M features (or attributes or variables).

T = {(X1,y1), (X2,y2), ... (Xn, yn)}

和

Xi is input vector {xi1, xi2, ... xiM}

yi is the label (or output or class).

RF摘要:

Random Forests算法是主要基于两种方法的分类器-

Random Forests algorithm is a classifier based on primarily two methods -

装袋
随机子空间方法.

假设我们决定在森林中拥有S树木的数量，然后我们首先创建"same size as original"的S个数据集，该数据集是通过对T中的数据进行随机重采样替换而创建的(每个数据集n次).这将产生{T1, T2, ... TS}数据集.这些中的每一个都称为引导程序数据集.由于有替换"，每个数据集Ti可能具有重复的数据记录，并且Ti可能会丢失原始数据集中的多个数据记录.这称为Bootstrapping. (en.wikipedia.org/wiki/Bootstrapping_(statistics))

Suppose we decide to have S number of trees in our forest then we first create S datasets of "same size as original" created from random resampling of data in T with-replacement (n times for each dataset). This will result in {T1, T2, ... TS} datasets. Each of these is called a bootstrap dataset. Due to "with-replacement" every dataset Ti can have duplicate data records and Ti can be missing several data records from original datasets. This is called Bootstrapping. (en.wikipedia.org/wiki/Bootstrapping_(statistics))

装袋是获取引导程序的过程；然后汇总在每个引导程序中学习的模型.

Bagging is the process of taking bootstraps & then aggregating the models learned on each bootstrap.

现在，RF创建S树并使用M可能特征中的m (=sqrt(M) or =floor(lnM+1))随机子特征来创建任何树.这称为随机子空间方法.

Now, RF creates S trees and uses m (=sqrt(M) or =floor(lnM+1)) random subfeatures out of M possible features to create any tree. This is called random subspace method.

因此，为每个Ti引导数据集创建一个树Ki.如果要对某些输入数据D = {x1, x2, ..., xM}进行分类，可以让它通过每棵树并产生S输出(每棵树一个)，可以用Y = {y1, y2, ..., ys}表示.最终预测是对此集的多数表决.

So for each Ti bootstrap dataset you create a tree Ki. If you want to classify some input data D = {x1, x2, ..., xM} you let it pass through each tree and produce S outputs (one for each tree) which can be denoted by Y = {y1, y2, ..., ys}. Final prediction is a majority vote on this set.

袋外错误:

创建分类器(S树)后，对于原始训练集中的每个(Xi,yi)，即T，选择所有不包含(Xi,yi)的Tk.请注意，该子集是一组boostrap数据集，其中不包含原始数据集中的特定记录.此集合称为袋外示例".有n个这样的子集(原始数据集T中的每个数据记录一个). OOB分类器是仅在Tk上的投票汇总，因此不包含(xi,yi).

After creating the classifiers (S trees), for each (Xi,yi) in the original training set i.e. T, select all Tk which does not include (Xi,yi). This subset, pay attention, is a set of boostrap datasets which does not contain a particular record from the original dataset. This set is called out-of-bag examples. There are n such subsets (one for each data record in original dataset T). OOB classifier is the aggregation of votes ONLY over Tk such that it does not contain (xi,yi).

泛化误差的袋外估计值是训练集上袋外分类器的错误率(与已知的yi进行比较).

Out-of-bag estimate for the generalization error is the error rate of the out-of-bag classifier on the training set (compare it with known yi's).

为什么重要?

(感谢@Rudolf的更正.他在下面的评论.)

(Thanks @Rudolf for corrections. His comments below.)

这篇关于什么是“随机森林"中的出库错误?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！