本文介绍了scikit-learn 支持向量机的 predict_proba 的混淆概率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的目的是根据特定类别的每个样本的排序概率绘制 PR 曲线.但是,我发现当我使用两个不同的标准数据集时,svm 的 predict_proba() 获得的概率有两种不同的行为:虹膜和数字.

My purpose is to draw the PR curve by the sorted probability of each sample for a specific class. However, I found that the obtained probabilities by svm's predict_proba() have two different behaviors when I use two different standard datasets: the iris and digits.

第一种情况是用iris"情况和下面的python代码进行评估的,它的工作原理是类获得最高概率.

The first case is evaluated with the "iris" case with the python code below, and it works reasonably that the class gets the highest probability.

D = datasets.load_iris()
clf = SVC(kernel=chi2_kernel, probability=True).fit(D.data, D.target)
output_predict = clf.predict(D.data)
output_proba = clf.predict_proba(D.data)
output_decision_function = clf.decision_function(D.data)
output_my = proba_to_class(output_proba, clf.classes_)

print D.data.shape, D.target.shape
print "target:", D.target[:2]
print "class:", clf.classes_
print "output_predict:", output_predict[:2]
print "output_proba:", output_proba[:2]

接下来,它产生如下输出.显然,每个样本的最高概率与 predict() 的输出匹配:样本 #1 的 0.97181088 和样本 #2 的 0.96961523.

Next, it produces the outputs like below. Apparently, the highest probability of each sample match the outputs of the predict(): The 0.97181088 for sample #1 and 0.96961523 for sample #2.

(150, 4) (150,)
target: [0 0]
class: [0 1 2]
output_predict: [0 0]
output_proba: [[ 0.97181088  0.01558693  0.01260218]
[ 0.96961523  0.01702481  0.01335995]]

但是,当我使用以下代码将数据集更改为数字"时,概率揭示了一种相反的现象,即每个样本的最低概率主导了 predict() 的输出标签,样本 #1 的概率为 0.00190932 和0.00220549 样品#2.

However, when I change the dataset to "digits" with the following code, the probabilities reveal an inverse phenomenon, that the lowest probability of each sample dominates the outputted labels of the predict() with probability 0.00190932 for sample #1 and 0.00220549 for sample #2.

D = datasets.load_digits()

输出:

(1797, 64) (1797,)
target: [0 1]
class: [0 1 2 3 4 5 6 7 8 9]
output_predict: [0 1]
output_proba: [[ 0.00190932  0.11212957  0.1092459   0.11262532      0.11150733  0.11208733
0.11156622  0.11043403  0.10747514  0.11101985]
[ 0.10991574  0.00220549  0.10944998  0.11288081  0.11178518   0.11234661
0.11182221  0.11065663  0.10770783  0.11122952]]

我已阅读这篇文章,它提供了一个解决方案使用带有decision_function() 的线性SVM.但是,由于我的任务,我仍然必须专注于 SVM 的卡方内核.

I've read this post and it leads a solution to using linear SVM with decision_function(). However, because of my task, I still have to focus on the chi-squared kernel for SVM.

有什么解决办法吗?

推荐答案

作为 文档说明,不能保证 predict_probapredict 会在 SVC 上给出一致的结果.您可以简单地使用 decision_function.对于线性和核 SVM 都是如此.

As the documentation states, there is no guarantee that predict_proba and predict will give consistent results on SVC.You can simply use decision_function. That is true for both linear and kernel SVM.

这篇关于scikit-learn 支持向量机的 predict_proba 的混淆概率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

05-17 19:16