Scikit-learn 返回小于 -1 的决定系数 (R^2) 值

本文介绍了Scikit-learn 返回小于 -1 的决定系数 (R^2) 值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在做一个简单的线性模型.我有

I'm doing a simple linear model. I have

fire = load_data()
regr = linear_model.LinearRegression()
scores = cross_validation.cross_val_score(regr, fire.data, fire.target, cv=10, scoring='r2')
print scores

产生的结果

[  0.00000000e+00   0.00000000e+00  -8.27299054e+02  -5.80431382e+00
  -1.04444147e-01  -1.19367785e+00  -1.24843536e+00  -3.39950443e-01
   1.95018287e-02  -9.73940970e-02]

这怎么可能?当我用内置的糖尿病数据做同样的事情时，它工作得很好，但对于我的数据，它返回这些看似荒谬的结果.我做错了什么吗?

How is this possible? When I do the same thing with the built in diabetes data, it works perfectly fine, but for my data, it returns these seemingly absurd results. Have I done something wrong?

推荐答案

r^2 没有理由不应该是负数(尽管 ^2 在它的名称).这也在 doc 中有所说明.您可以将 r^2 视为模型拟合(在线性回归的上下文中，例如 1 阶模型(仿射))与 0 阶模型(仅拟合常数)的比较，都通过最小化平方损失.最小化平方误差的常数是平均值.由于您正在使用遗漏数据进行交叉验证，因此可能会发生测试集的平均值与训练集的平均值大不相同的情况.与仅预测测试数据的平均值相比，仅此一项就会在您的预测中引起更高的平方误差，从而导致 r^2 分数为负.

There is no reason r^2 shouldn't be negative (despite the ^2 in its name). This is also stated in the doc. You can see r^2 as the comparison of your model fit (in the context of linear regression, e.g a model of order 1 (affine)) to a model of order 0 (just fitting a constant), both by minimizing a squared loss. The constant minimizing the squared error is the mean. Since you are doing cross validation with left out data, it can happen that the mean of your test set is wildly different from the mean of your training set. This alone can induce a much higher incurred squared error in your prediction versus just predicting the mean of the test data, which results in a negative r^2 score.

在最坏的情况下，如果您的数据根本无法解释您的目标，这些分数可能会变得非常负面.试试

In worst case, if your data do not explain your target at all, these scores can become very strongly negative. Try

import numpy as np
rng = np.random.RandomState(42)
X = rng.randn(100, 80)
y = rng.randn(100)  # y has nothing to do with X whatsoever
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import cross_val_score
scores = cross_val_score(LinearRegression(), X, y, cv=5, scoring='r2')

这应该会产生负的 r^2 值.

This should result in negative r^2 values.

In [23]: scores
Out[23]: 
array([-240.17927358,   -5.51819556,  -14.06815196,  -67.87003867,
    -64.14367035])

现在的重要问题是，这是因为线性模型在您的数据中没有找到任何东西，还是因为在您的数据预处理中可能已修复的其他问题.您是否尝试过将列缩放为均值 0 和方差 1?您可以使用 sklearn.preprocessing.StandardScaler 执行此操作.事实上，您应该通过使用 sklearn.pipeline.Pipeline 将 StandardScaler 和 LinearRegression 连接到管道中来创建一个新的估算器.接下来您可能想尝试岭回归.

The important question now is whether this is due to the fact that linear models just do not find anything in your data, or to something else that may be fixed in the preprocessing of your data. Have you tried scaling your columns to have mean 0 and variance 1? You can do this using sklearn.preprocessing.StandardScaler. As a matter of fact, you should create a new estimator by concatenating a StandardScaler and the LinearRegression into a pipeline using sklearn.pipeline.Pipeline.Next you may want to try Ridge regression.

这篇关于Scikit-learn 返回小于 -1 的决定系数 (R^2) 值的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！