本文介绍了ValueError:在LinearSVC期间,数组在_assert_all_finite中包含NaN或无穷大的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在此处对葡萄酒数据集进行分类- http://archive.ics.uci.edu/ml/datasets/Wine+Quality 使用逻辑回归(使用方法='bfgs'和l1范数)并捕获了奇异值矩阵错误(提高LinAlgError('Singular matrix'),尽管排名[我使用np.linalg.matrix_rank(data [ train_cols] .values)]

I was trying to classify the wine data set here -http://archive.ics.uci.edu/ml/datasets/Wine+Qualityusing logistic regression (with method ='bfgs' and l1 norm) and caught a singular value matrix error(raise LinAlgError('Singular matrix'), in-spite of full rank [which I tested using np.linalg.matrix_rank(data[train_cols].values) ] .

这就是我得出的结论,即某些功能可能是其他功能的线性组合.为此,我尝试使用Grid search/LinearSVC-并得到以下错误以及我的代码&数据集.

This is how I came to the conclusion that some features might be linear combinations of others. Towards this, I experimented of using Grid search/LinearSVC - and I get the error below, along with my code & data-set .

我看到只有6/7个功能实际上是独立的"-我在比较x_train_new [0]和x_train的行时会解释这些功能(这样我就可以知道哪些列是多余的)

I can see that only 6/7 features are actually "independent" - which I interpret when comparing the rows of x_train_new[0] and x_train (so I can get which columns are redundant)

    # Train & test DATA CREATION
    from sklearn.svm import LinearSVC
    import numpy, random
    import pandas as pd
    df = pd.read_csv("https://github.com/ekta1007/Predicting_wine_quality/blob/master/wine_red_dataset.csv")
#,skiprows=0, sep=',')


    df=df.dropna(axis=1,how='any') # also tried how='all' - still get NaN errors as below
    header=list(df.columns.values) # or df.columns
    X = df[df.columns - [header[-1]]] # header[-1] = ['quality'] - this is to make the code genric enough
    Y = df[header[-1]] # df['quality']
    rows = random.sample(df.index, int(len(df)*0.7)) # indexing the rows that will be picked in the train set
    x_train, y_train = X.ix[rows],Y.ix[rows] # Fetching the data frame using indexes
    x_test,y_test  = X.drop(rows),Y.drop(rows)


# Training the classifier using C-Support Vector Classification.
clf = LinearSVC(C=0.01, penalty="l1", dual=False) #,tol=0.0001,fit_intercept=True, intercept_scaling=1)
clf.fit(x_train, y_train)
x_train_new = clf.fit_transform(x_train, y_train)
#print x_train_new #works
clf.predict(x_test) # does NOT work and gives NaN errors for some x_tests


clf.score(x_test, y_test) # Does NOT work
clf.coef_ # Works, but I am not sure, if this is OK, given huge NaN's - or does the coef's get impacted ?

clf.predict(x_train)
552   NaN
209   NaN
427   NaN
288   NaN
175   NaN
427   NaN
748     7
552   NaN
429   NaN
[... and MORE]
Name: quality, Length: 1119

clf.predict(x_test)
76    NaN
287   NaN
420     7
812   NaN
443     7
420     7
430   NaN
373     5
624     5
[..and More]
Name: quality, Length: 480

奇怪的是,当我运行clf.predict(x_train)时,我仍然看到一些NaN-我做错了吗?在所有模型都使用此模型进行训练之后,这应该不会发生,对吗?

根据该线程,我还检查了我的csv文件中是否没有空值(尽管我将质量"重新标记为仅5和7个标签(从range(3,10)如何修复"NaN或无穷大" ;稀疏矩阵在python中出现问题?

According to this thread, I also checked that there are no null's in my csv file (though I relabeled the "quality' to 5 and 7 labels only (from range(3,10)How to fix "NaN or infinity" issue for sparse matrix in python?

也-这是x_test& y_test/train ...

x_test
<class 'pandas.core.frame.DataFrame'>
Int64Index: 480 entries, 1 to 1596
Data columns:
alcohol                 480  non-null values
chlorides               480  non-null values
citric acid             480  non-null values
density                 480  non-null values
fixed acidity           480  non-null values
free sulfur dioxide     480  non-null values
pH                      480  non-null values
residual sugar          480  non-null values
sulphates               480  non-null values
total sulfur dioxide    480  non-null values
volatile acidity        480  non-null values
dtypes: float64(11)

y_test
1     5
10    5
18    5
21    5
30    5
31    7
36    7
40    5
50    5
52    7
53    5
55    5
57    5
60    5
61    5
[..And MORE]
Name: quality, Length: 480

最后.

clf.score(x_test, y_test)

Traceback (most recent call last):
  File "<pyshell#31>", line 1, in <module>
    clf.score(x_test, y_test)
  File "C:\Python27\lib\site-packages\sklearn\base.py", line 279, in score
    return accuracy_score(y, self.predict(X))
  File "C:\Python27\lib\site-packages\sklearn\metrics\metrics.py", line 742, in accuracy_score
    y_true, y_pred = check_arrays(y_true, y_pred)
  File "C:\Python27\Lib\site-packages\sklearn\utils\validation.py", line 215, in check_arrays
  File "C:\Python27\Lib\site-packages\sklearn\utils\validation.py", line 18, in _assert_all_finite
ValueError: Array contains NaN or infinity.


#I also explicitly checked for NaN's as here -:
for i in df.columns:
    df[i].isnull()

提示:也请提及,鉴于我的用例,关于使用LinearSVC的思考过程是否正确,还是应该使用Grid-search?

Tip : Please also mention if my thought process on using LinearSVC is correct, given my use case, or should I use Grid-search ?

免责声明:这段代码的一部分是基于StackOverflow和其他来源的类似上下文中的建议而构建的-如果此方法非常适合我的情况,我的实际用例就是尝试访问.就是这样.

Disclaimer : Parts of this code have been built on suggestions in similar contexts from StackOverflow and miscellaneous sources - My real use case is just trying to access if this method is a good fit for my scenario. That's all.

推荐答案

此方法有效.我唯一真正需要更改的是使用x_test * .values *以及其余的pandas Dataframes(x_train,y_train,y_test).正如指出的那样,唯一的原因是熊猫df和scikit-learn(使用numpy数组)之间不兼容

This worked. The only I had to really change was use x_test*.values* along with the rest of pandas Dataframes(x_train, y_train, y_test) . As pointed out the only reason was incompatibility between pandas df and scikit-learn(which uses numpy arrays)

 #changing your Pandas Dataframe elegantly to work with scikit-learn by transformation to  numpy arrays
>>> type(x_test)
<class 'pandas.core.frame.DataFrame'>
>>> type(x_test.values)
<type 'numpy.ndarray'>

此黑客来自此帖子 http://python.dzone. com/articles/python-making-scikit-learn-and 和@AndreasMueller-指出了不一致之处.

This hack comes from this post http://python.dzone.com/articles/python-making-scikit-learn-and and @AndreasMueller - who pointed out the inconsistency.

这篇关于ValueError:在LinearSVC期间,数组在_assert_all_finite中包含NaN或无穷大的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-01 08:23