sklearn LogisticRegressionCV是否将所有数据用于最终模型

本文介绍了sklearn LogisticRegressionCV是否将所有数据用于最终模型的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想知道如何计算sklearn中LogisticRegressionCV的最终模型(即决策边界).所以说我有一些Xdata和ylabel这样的

I was wondering how the final model (i.e. decision boundary) of LogisticRegressionCV in sklearn was calculated. So say I have some Xdata and ylabels such that

Xdata # shape of this is (n_samples,n_features)
ylabels # shape of this is (n_samples,), and it is binary

现在我跑

from sklearn.linear_model import LogisticRegressionCV
clf = LogisticRegressionCV(Cs=[1.0],cv=5)
clf.fit(Xdata,ylabels)

这只是CV中的一个正则化参数和5倍折叠.因此，clf.scores_将是一个字典，其中有一个键，其值是形状为(n_folds，1)的数组.通过这五折，您可以更好地了解模型的性能.

This is looking at just one regularization parameter and 5 folds in the CV. So clf.scores_ will be a dictionary with one key with a value that is an array with shape (n_folds,1). With these five folds you can get a better idea of how the model performs.

但是，我对从clf.coef_获得的信息感到困惑(并且我假设clf.coef_中的参数是clf.predict中使用的参数).我认为可能有几种选择:

However, I'm confused about what you get from clf.coef_ (and I'm assuming the parameters in clf.coef_ are the ones used in clf.predict). I have a few options I think it could be:

clf.coef_中的参数来自对所有数据进行模型训练
clf.coef_中的参数来自最佳得分倍数
clf.coef_中的参数以某种方式在折痕处平均.

The parameters in clf.coef_ are from training the model on all the data
The parameters in clf.coef_ are from the best scoring fold
The parameters in clf.coef_ are averaged across the folds in some way.

我想这是一个重复的问题，但是对于我一生来说，我无法在sklearn文档或LogisticRegressionCV的源代码中找到简单的在线答案.我发现一些相关的帖子是:

I imagine this is a duplicate question, but for the life of me I can't find a straightforward answer online, in the sklearn documentation, or in the source code for LogisticRegressionCV. Some relevant posts I found are:

GridSearchCV最终模型
scikit学习LogisticRegressionCV:最佳系数
通过交叉验证评估Logistic回归

GridSearchCV final model
scikit-learn LogisticRegressionCV: best coefficients
Using cross validation and AUC-ROC for a logistic regression model in sklearn
Evaluating Logistic regression with cross validation

推荐答案

您在超级参数和参数之间产生了误解.最终具有CV的所有scikit-learn估计量(例如LogisticRegressionCV，GridSearchCV或RandomizedSearchCV)都会调整超参数.

You are mistaking between hyper-parameters and parameters. All scikit-learn estimators which have CV in the end, like LogisticRegressionCV, GridSearchCV, or RandomizedSearchCV tune the hyper-parameters.

无法从数据训练中学到超参数.在学习之前设置它们，前提是它们将有助于最佳学习.有关更多信息，请在此处显示:

Hyper-parameters are not learnt from training on the data. They are set prior to learning assuming that they will contribute to optimal learning. More information is present here:

对于LogisticRegression，C是一个超参数，描述了正则化强度的倒数. C越高，对训练进行的正则化越少.并非C会在培训期间更改.它将得到解决.

In case of LogisticRegression, C is a hyper-parameter which describes the inverse of regularization strength. The higher the C, the less regularization is applied on the training. Its not that C will be changed during training. It will be fixed.

现在进入coef_. coef_包含特征的系数(也称为权重)，这些系数在训练过程中学习(并更新).现在，根据C的值(以及构造器中存在的其他超参数)，这些值在训练过程中可能会有所不同.

Now coming to coef_. coef_ contains coefficient (also called weights) of the features, which are learnt (and updated) during the training. Now depending on the value of C (and other hyper-parameters present in contructor), these can vary during the training.

现在还有另一个主题，关于如何获得coef_的最佳初始值，从而使训练更快，更好.多数民众赞成在优化.一些以0-1之间的随机权重开始，其他的以0等开始，依此类推.但是对于您的问题范围，这是不相关的.不使用LogisticRegressionCV.

Now there is another topic on how to get the optimum initial values of coef_, so that the training is faster and better. Thats optimization. Some start with random weights between 0-1, others start with 0, etc etc. But for the scope of your question, that is not relevant. LogisticRegressionCV is not used for that.

这是LogisticRegressionCV的作用:

This is what LogisticRegressionCV does:

从构造函数中获取不同的C的值(在您的示例中，您传递了1.0).
对于每个C值，请对提供的数据进行交叉验证，其中对当前折弯的训练数据的LogisticRegression将为fit()，并在测试数据上进行评分.来自所有倍数的测试数据的分数被平均，并且成为当前C的分数.这是对您提供的所有C值完成的，并且将会选择平均得分最高的C.
现在将所选的C设置为最终的C，并且再次对整个数据(此处为Xdata,ylabels)进行LogisticRegression训练(通过调用fit()).

Get the values of different C from constructor (In your example you passed 1.0).
For each value of C, do the cross-validation of supplied data, in which the LogisticRegression will be fit() on training data of the current fold, and scored on the test data. The scores from test data of all folds are averaged and that becomes the score of the current C. This is done for all C values you provided, and the C with the highest average score will be chosen.
Now the chosen C is set as the final C and LogisticRegression is again trained (by calling fit()) on the whole data (Xdata,ylabels here).

这就是GridSearchCV或LogisticRegressionCV或LassoCV等所有超参数调整器的作用.

Thats what all the hyper-parameter tuners do, be it GridSearchCV, or LogisticRegressionCV, or LassoCV etc.

coef_特征权重的初始化和更新是在算法的fit()函数内部完成的，该功能超出了超参数调整的范围.该优化部分取决于过程的内部优化算法.例如，在LogisticRegression情况下的solver参数.

The initializing and updating of coef_ feature weights is done inside the fit() function of the algorithm which is out of scope for the hyper-parameter tuning. That optimization part is dependent on the internal optimization algorithm of the process. For example solver param in case of LogisticRegression.

希望这可以使事情变得清楚.随时询问是否还有疑问.

Hope this makes things clear. Feel free to ask if still any doubt.

这篇关于sklearn LogisticRegressionCV是否将所有数据用于最终模型的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！