其他参数内部的参数-在集成学习中使用带有随机森林的引导聚合

本文介绍了其他参数内部的参数-在集成学习中使用带有随机森林的引导聚合的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

假设我决定使用整体方法-如果有区别，我们将使用虹膜数据集.在可用的集成技术中，我们将重点介绍并行方法，并从中使用sklearn进行引导聚合.

Let’s say I decide to use an ensemble method - if it makes a difference, we’ll use the iris dataset. Of the available ensemble techniques, we’ll focus on the parallel methods, and from those we’ll take bootstrap aggregation, using sklearn.

Sklearn通过使用BaggingClassifier实现引导聚合，(文档告诉我们)"c0"是适合基本分类器的整体元估计器……"在这些基本分类器中，让我们选择RandomForestClassifier，其本身就是是一个元决策器，它适合许多决策树分类器."

Sklearn implements bootstrap aggregation by using BaggingClassifier, which (the documentation tells us) is "an ensemble meta-estimator that fits base classifiers…" Of those base classifiers, let’s select RandomForestClassifier, which itself is "is a meta estimator that fits a number of decision tree classifiers".

有人告诉我们，引导程序聚合基本上有四种形式:装袋，粘贴，随机子空间和随机补丁.在BaggingClassifier中，我们通过操纵BaggingClassifier的11个参数中的4个来激活这四个风味中的每一个，即:bootstrap_features(对/错)，bootstrap(对/错)，max_features(= 1/< 1)和max_samples(= 1/< 1).

Bootstrap aggregation, we’re told, comes essentially in four flavors: bagging, pasting, random subspaces and random patches. In BaggingClassifier, we activate each of these four flavors by manipulating 4 of the 11 parameters of BaggingClassifier, namely: bootstrap_features (True/False), bootstrap (True/False), max_features (=1/<1), and max_samples (=1/<1).

在sklearn中，要将BaggingClassifier与RandomForestClassifier结合使用，我们需要:

In sklearn, to use BaggingClassifier with RandomForestClassifier we need to:

clf = BaggingClassifier(RandomForestClassifier(parameters), parameters)

结果表明，在RandomForestClassifier的17个参数中，有两个与BaggingClassifier的相同:bootstrap和max_features.虽然bootstrap对于BaggingClassifier和RandomForestClassifier都是相同的(即，进行替换/不进行采样)，但是我不确定max_features.在BaggingClassifier中，max_features是从X绘制到火车的特征数量"，在这种情况下为RandomForestClassifier.在RandomForestClassifier中，它是寻找最佳分割时要考虑的功能数量".

Turns out that among RandomForestClassifier’s 17 parameters, two are the same as those of BaggingClassifier’s: bootstrap and max_features. While bootstrap is the same for both BaggingClassifier and RandomForestClassifier (i.e., sampling with/without replacement), I’m not sure about max_features. In BaggingClassifier, max_features is "the number of features to draw from X to train", in this case, RandomForestClassifier. While in RandomForestClassifier, it’s "the number of feature to consider when looking for the best split".

最后，这使我想到了一个问题:我们如何在这两个分类器中协调这些参数，以便我们可以在随机森林中的每棵树中获得四种引导聚合?我不只是问这样的东西是否可以作为粘贴味:

And this, finally, brings me to the question: how do we coordinate these parameters in these two classifiers so we can get the four flavors of bootstrap aggregation in each of the trees in the random forest? I’m not just asking if something like this works as the pasting flavor:

clf=BaggingClassifier(RandomForestClassifier(bootstrap = False, max_features = 1.0),
   bootstrap_features = False, bootstrap = False, max_features = 1.0, max_samples = 0.6 )

我真的想真正理解当BaggingClassifier调用RandomForestClassifier并将所有这些参数都调整为不同值时发生的情况.

I’m really trying to actually understand what’s going on behind the scene when BaggingClassifier calls on RandomForestClassifier with all these parameters tuned to different values.

推荐答案

随机森林"参数和合奏分类器"参数之间没有冲突.随机森林具有相似参数(两个参数btw，max_features相同，只是用不同的方式表述)的原因是因为随机森林本身是一种Ensemble算法.

There is no conflict between the Random Forest parameters and the Ensemble Classifier parameters. The reason why Random Forest has similar parameters (btw, max_features is the same in the two, it is just phrased in a different way) is because Random Forest itself is an Ensemble algorithm.

因此，您在这里想要实现的是Ensemble of Ensemble分类器，其中每个分类器都有自己的参数.如果我稍微更改您的示例以使其更易于理解，我们将:

Hence, what you are trying to achieve here is an Ensemble of Ensemble classifiers, where each has its own parameters. If I slightly change your example to make it easier to understand, we have:

BaggingClassifier(RandomForestClassifier(n_estimators = 100, bootstrap = True, max_features = 0.5), n_estimators = 5,
   bootstrap_features = False, bootstrap = False, max_features = 1.0, max_samples = 0.6 )

这是它的工作方式:

首先，EnsembleClassifier将采用所有功能(由给出)，并在不进行替换(bootstrap = False)的情况下抽取样本的60％(max_samples = 0.6)
然后将所有特征和60％的样本提供给RandomForest
- 随机森林"选择Ensemble在上一步传递的50％的特征而不进行替换(max_features = 0.5)(在我们的示例中为所有特征)，并对60个特征进行引导抽样(替换)整体分类器传递的％样本.基于此，它会训练决策树，并使用新功能和新引导程序重复执行此过程n_estimators = 100次
- First the EnsembleClassifier would take all the features (which is given by bootstrap_features = False, max_features = 1.0) and draw 60% (max_samples = 0.6) of your sample without replacement (bootstrap = False)
- Then it feeds all the features and 60% of the sample to a RandomForest
  - The Random Forest selects 50% of the features without replacement (max_features = 0.5) passed by the Ensemble at the previous step (which, in our case, are all features) and does a bootstrap sampling (with replacement) of the 60% sample passed by the Ensemble Classifier. Based on this, it trains a Decision Tree and repeats this procedure n_estimators = 100 times, with new features and new bootstraping
  Ensemble分类器将其重复n_estimators = 5次.
  This is repeated n_estimators = 5 times by the Ensemble Classifier.
  希望这会有所帮助！
  TLDR:传递给RandomForestClassifier和EnsembleClassifier的参数可能具有相同的名称并且实际上执行相同的操作，它们在训练过程的不同阶段执行，并且如果您将bootstrap = False设置为一个，不会将此参数值传递给另一个参数.
  TLDR: the parameters you pass to RandomForestClassifier and EnsembleClassifier might have the same name and actually do the same thing, they do it at different stages of the training process, and if you set bootstrap = False in one, it won't pass this parameter value to the other one.
  
  这篇关于其他参数内部的参数-在集成学习中使用带有随机森林的引导聚合的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！