本文介绍了确保 scikit learn 中随机森林分类的​​正确操作顺序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想确保我的机器学习的操作顺序是正确的:

I would like to ensure that the order of operations for my machine learning is right:

from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.grid_search import GridSearchCV

# 1. Initialize model
model = RandomForestClassifier(5000)

# 2. Load dataset
iris = datasets.load_iris()
X, y = iris.data, iris.target

# 3. Remove unimportant features
model = SelectFromModel(model, threshold=0.5).estimator

# 4. cross validate model on the important features
k_fold = KFold(n=len(data), n_folds=10, shuffle=True)
for k, (train, test) in enumerate(k_fold):
    self.model.fit(data[train], target[train])

# 5. grid search for best parameters
param_grid = {
    'n_estimators': [1000, 2500, 5000],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth': [3, 5, data.shape[1]]
}

gs = GridSearchCV(estimator=model, param_grid=param_grid)
gs.fit(X, y)
model = gs.best_estimator_

# Now the model can be used for prediction

请告诉我这个订单是否看起来不错,或者是否可以做些什么来改进它.

Please let me know if this order looks good or if something can be done to improve it.

--编辑,澄清以减少投票.

--EDIT, clarifying to reduce downvotes.

具体来说,1.SelectFromModel 是否应该在交叉验证后进行?

Specifically,1. Should the SelectFromModel be done after cross validation?

  1. 是否应该在交叉验证之前进行网格搜索?

推荐答案

您的方法的主要问题是您将特征选择转换器与最终估计器混淆了.您需要做的是创建两个阶段,首先是转换器:

The main problem with your approach is you are confusing the feature selection transformer with the final estimator. What you will need to do is create two stages, the transformer first:

rf_feature_imp = RandomForestClassifier(100)
feat_selection = SelectFromModel(rf_feature_imp, threshold=0.5)

然后您需要第二个阶段,在该阶段您使用缩减的特征集在缩减的特征集上训练分类器.

Then you need a second phase where you use the reduced feature set to train a classifier on the reduced feature set.

clf = RandomForestClassifier(5000)

一旦有了阶段,您就可以构建一个管道,将两者结合成最终模型.

Once you have your phases, you can build a pipeline to combine the two into a final model.

model = Pipeline([
          ('fs', feat_selection), 
          ('clf', clf), 
        ])

现在您可以对您的 model 执行 GridSearch.请记住,您有两个阶段,因此参数必须由阶段 fsclf 指定.在特征选择阶段,您还可以使用 fs__estimator 访问基本估计器.下面是如何在三个对象中的任何一个上搜索参数的示例.

Now you can perform a GridSearch on your model. Keep in mind you have two stages, so the parameters must be specified by stage fs or clf. In terms of the feature selection stage, you can also access the base estimator using fs__estimator. Below is an example of how to search parameters on any of the three objects.

 params = {
    'fs__threshold': [0.5, 0.3, 0.7],
    'fs__estimator__max_features': ['auto', 'sqrt', 'log2'],
    'clf__max_features': ['auto', 'sqrt', 'log2'],
 }

 gs = GridSearchCV(model, params, ...)
 gs.fit(X,y)

然后您可以直接使用 gs 或使用 gs.best_estimator_ 进行预测.

You can then make predictions with gs directly or using gs.best_estimator_.

这篇关于确保 scikit learn 中随机森林分类的​​正确操作顺序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-28 03:51