为什么单棵树的随机森林比决策树分类器好得多?

本文介绍了为什么单棵树的随机森林比决策树分类器好得多?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我申请决策树分类器和随机森林分类器使用以下代码对我的数据进行处理:

I apply thedecision tree classifier and the random forest classifier to my data with the following code:

def decision_tree(train_X, train_Y, test_X, test_Y):

    clf = tree.DecisionTreeClassifier()
    clf.fit(train_X, train_Y)

    return clf.score(test_X, test_Y)


def random_forest(train_X, train_Y, test_X, test_Y):
    clf = RandomForestClassifier(n_estimators=1)
    clf = clf.fit(X, Y)

    return clf.score(test_X, test_Y)

为什么随机森林分类器的结果要好得多(100 次运行，随机抽取 2/3 的数据用于训练，1/3 用于测试)?

Why the result are so much better for the random forest classifier (for 100 runs, with randomly sampling 2/3 of data for the training and 1/3 for the test)?

100%|███████████████████████████████████████| 100/100 [00:01<00:00, 73.59it/s]
Algorithm: Decision Tree
  Min     : 0.3883495145631068
  Max     : 0.6476190476190476
  Mean    : 0.4861783113770316
  Median  : 0.48868030937802126
  Stdev   : 0.047158171852401135
  Variance: 0.0022238931724605985
100%|███████████████████████████████████████| 100/100 [00:01<00:00, 85.38it/s]
Algorithm: Random Forest
  Min     : 0.6846846846846847
  Max     : 0.8653846153846154
  Mean    : 0.7894823428836184
  Median  : 0.7906101571063208
  Stdev   : 0.03231671150915106
  Variance: 0.0010443698427656967

带有一个估计器的随机森林估计器不仅仅是一棵决策树?我做错了什么或误解了这个概念吗?

The random forest estimators with one estimator isn't just a decision tree?Have i done something wrong or misunderstood the concept?

推荐答案

嗯，这是个好问题，答案是否；随机森林算法不仅仅是一个单独生长的决策树的简单袋子.

Well, this is a good question, and the answer turns out to be no; the Random Forest algorithm is more than a simple bag of individually-grown decision trees.

除了由集成许多树引起的随机性之外，随机森林 (RF) 算法还在以两种不同的方式构建单个树时结合了随机性，而简单的决策树中都不存在这两种方式(DT)算法.

Apart from the randomness induced from ensembling many trees, the Random Forest (RF) algorithm also incorporates randomness when building individual trees in two distinct ways, none of which is present in the simple Decision Tree (DT) algorithm.

首先是在每个树节点寻找最佳分割时要考虑的特征数量:DT 考虑所有特征，RF 考虑它们的随机子集，其大小等于参数 max_features(请参阅文档).

The first is the number of features to consider when looking for the best split at each tree node: while DT considers all the features, RF considers a random subset of them, of size equal to the parameter max_features (see the docs).

第二个是，当 DT 考虑整个训练集时，单个 RF 树只考虑它的一个自举子样本；再次来自 docs:

The second is that, while DT considers the whole training set, a single RF tree considers only a bootstrapped sub-sample of it; from the docs again:

子样本大小始终与原始输入样本大小相同，但如果 bootstrap=True(默认)，则样本会替换绘制.

RF 算法本质上是两个独立思想的组合:装袋和随机选择特征(参见维基百科条目以获得很好的概述).Bagging 本质上是我上面的第二点，但适用于整体；特征的随机选择是我上面的第一点，它似乎是在 Breiman 的 RF 之前由 Tin Kam Ho 独立提出的(再次参见维基百科条目).Ho 已经建议单独随机特征选择可以提高性能.这与您在这里所做的并不完全相同(您仍然使用装袋中的引导抽样想法)，但是您可以通过在 RandomForestClassifier() 中设置 bootstrap=False 来轻松复制 Ho 的想法参数.事实是，鉴于这项研究，性能的差异并不出乎意料......

The RF algorihm is essentially the combination of two independent ideas: bagging, and random selection of features (see the Wikipedia entry for a nice overview). Bagging is essentially my second point above, but applied to an ensemble; random selection of features is my first point above, and it seems that it had been independently proposed by Tin Kam Ho before Breiman's RF (again, see the Wikipedia entry). Ho had already suggested that random feature selection alone improves performance. This is not exactly what you have done here (you still use the bootstrap sampling idea from bagging, too), but you could easily replicate Ho's idea by setting bootstrap=False in your RandomForestClassifier() arguments. The fact is that, given this research, the difference in performance is not unexpected...

要准确地复制RandomForestClassifier()中单个树的行为，您应该同时使用bootstrap=False和max_features=None 参数，即

To replicate exactly the behaviour of a single tree in RandomForestClassifier(), you should use both bootstrap=False and max_features=None arguments, i.e.

clf = RandomForestClassifier(n_estimators=1, max_features=None, bootstrap=False)

在这种情况下，既不会进行bootstrap采样也不会进行随机特征选择，性能应该大致等于单个决策树的性能.

in which case neither bootstrap sampling nor random feature selection will take place, and the performance should be roughly equal to that of a single decision tree.

                        这篇关于为什么单棵树的随机森林比决策树分类器好得多?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！