问题描述
我想知道如何计算sklearn中LogisticRegressionCV的最终模型(即决策边界).所以说我有一些Xdata和ylabel这样的
I was wondering how the final model (i.e. decision boundary) of LogisticRegressionCV in sklearn was calculated. So say I have some Xdata and ylabels such that
Xdata # shape of this is (n_samples,n_features)
ylabels # shape of this is (n_samples,), and it is binary
现在我跑
from sklearn.linear_model import LogisticRegressionCV
clf = LogisticRegressionCV(Cs=[1.0],cv=5)
clf.fit(Xdata,ylabels)
这只是CV中的一个正则化参数和5倍折叠.因此,clf.scores_
将是一个字典,其中有一个键,其值是形状为(n_folds,1)的数组.通过这五折,您可以更好地了解模型的性能.
This is looking at just one regularization parameter and 5 folds in the CV. So clf.scores_
will be a dictionary with one key with a value that is an array with shape (n_folds,1). With these five folds you can get a better idea of how the model performs.
但是,我对从clf.coef_
获得的信息感到困惑(并且我假设clf.coef_
中的参数是clf.predict
中使用的参数).我认为可能有几种选择:
However, I'm confused about what you get from clf.coef_
(and I'm assuming the parameters in clf.coef_
are the ones used in clf.predict
). I have a few options I think it could be:
-
clf.coef_
中的参数来自对所有数据进行模型训练 -
clf.coef_
中的参数来自最佳得分倍数 -
clf.coef_
中的参数以某种方式在折痕处平均.
- The parameters in
clf.coef_
are from training the model on all the data - The parameters in
clf.coef_
are from the best scoring fold - The parameters in
clf.coef_
are averaged across the folds in some way.
我想这是一个重复的问题,但是对于我一生来说,我无法在sklearn文档或LogisticRegressionCV的源代码中找到简单的在线答案.我发现一些相关的帖子是:
I imagine this is a duplicate question, but for the life of me I can't find a straightforward answer online, in the sklearn documentation, or in the source code for LogisticRegressionCV. Some relevant posts I found are:
- GridSearchCV final model
- scikit-learn LogisticRegressionCV: best coefficients
- Using cross validation and AUC-ROC for a logistic regression model in sklearn
- Evaluating Logistic regression with cross validation
推荐答案
您在超级参数和参数之间产生了误解.最终具有CV的所有scikit-learn估计量(例如LogisticRegressionCV
,GridSearchCV
或RandomizedSearchCV
)都会调整超参数.
You are mistaking between hyper-parameters and parameters. All scikit-learn estimators which have CV in the end, like LogisticRegressionCV
, GridSearchCV
, or RandomizedSearchCV
tune the hyper-parameters.
无法从数据训练中学到超参数.在学习之前设置它们,前提是它们将有助于最佳学习.有关更多信息,请在此处显示:
Hyper-parameters are not learnt from training on the data. They are set prior to learning assuming that they will contribute to optimal learning. More information is present here:
对于LogisticRegression,C
是一个超参数,描述了正则化强度的倒数. C越高,对训练进行的正则化越少.并非C
会在培训期间更改.它将得到解决.
In case of LogisticRegression, C
is a hyper-parameter which describes the inverse of regularization strength. The higher the C, the less regularization is applied on the training. Its not that C
will be changed during training. It will be fixed.
现在进入coef_
. coef_
包含特征的系数(也称为权重),这些系数在训练过程中学习(并更新).现在,根据C的值(以及构造器中存在的其他超参数),这些值在训练过程中可能会有所不同.
Now coming to coef_
. coef_
contains coefficient (also called weights) of the features, which are learnt (and updated) during the training. Now depending on the value of C (and other hyper-parameters present in contructor), these can vary during the training.
现在还有另一个主题,关于如何获得coef_
的最佳初始值,从而使训练更快,更好.多数民众赞成在优化.一些以0-1之间的随机权重开始,其他的以0等开始,依此类推.但是对于您的问题范围,这是不相关的.不使用LogisticRegressionCV.
Now there is another topic on how to get the optimum initial values of coef_
, so that the training is faster and better. Thats optimization. Some start with random weights between 0-1, others start with 0, etc etc. But for the scope of your question, that is not relevant. LogisticRegressionCV is not used for that.
这是LogisticRegressionCV的作用:
This is what LogisticRegressionCV does:
- 从构造函数中获取不同的
C
的值(在您的示例中,您传递了1.0). - 对于每个
C
值,请对提供的数据进行交叉验证,其中对当前折弯的训练数据的LogisticRegression将为fit()
,并在测试数据上进行评分.来自所有倍数的测试数据的分数被平均,并且成为当前C
的分数.这是对您提供的所有C
值完成的,并且将会选择平均得分最高的C
. - 现在将所选的
C
设置为最终的C
,并且再次对整个数据(此处为Xdata,ylabels
)进行LogisticRegression训练(通过调用fit()
).
- Get the values of different
C
from constructor (In your example you passed 1.0). - For each value of
C
, do the cross-validation of supplied data, in which the LogisticRegression will befit()
on training data of the current fold, and scored on the test data. The scores from test data of all folds are averaged and that becomes the score of the currentC
. This is done for allC
values you provided, and theC
with the highest average score will be chosen. - Now the chosen
C
is set as the finalC
and LogisticRegression is again trained (by callingfit()
) on the whole data (Xdata,ylabels
here).
这就是GridSearchCV或LogisticRegressionCV或LassoCV等所有超参数调整器的作用.
Thats what all the hyper-parameter tuners do, be it GridSearchCV, or LogisticRegressionCV, or LassoCV etc.
coef_
特征权重的初始化和更新是在算法的fit()
函数内部完成的,该功能超出了超参数调整的范围.该优化部分取决于过程的内部优化算法.例如,在LogisticRegression
情况下的solver
参数.
The initializing and updating of coef_
feature weights is done inside the fit()
function of the algorithm which is out of scope for the hyper-parameter tuning. That optimization part is dependent on the internal optimization algorithm of the process. For example solver
param in case of LogisticRegression
.
希望这可以使事情变得清楚.随时询问是否还有疑问.
Hope this makes things clear. Feel free to ask if still any doubt.
这篇关于sklearn LogisticRegressionCV是否将所有数据用于最终模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!