本文介绍了Sklearn中的PCA-ValueError:数组不得包含infs或NaN的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用网格搜索来选择数据的主成分数,然后再进行线性回归.我很困惑如何才能对想要的主要成分数量进行字典编制.我将列表放入param_grid参数中的字典格式中,但我认为我做错了.到目前为止,我已经收到有关包含infs或NaNs的数组的警告.

I am trying to use grid search to choose the number of principal components of the data before fitting into a linear regression. I am confused how I can make a dictionary of the number of principal components I want. I put my list into a dictionary format in the param_grid parameter, but I think I did it wrong. So far, I have gotten a warning about my array containing infs or NaNs.

我正在按照从线性回归到PCA的说明进行操作: http://scikit-learn.org/stable/auto_examples/plot_digits_pipe.html

I am following the instructions from pipelining a linear regression to PCA: http://scikit-learn.org/stable/auto_examples/plot_digits_pipe.html

ValueError:数组不能包含infs或NaNs

ValueError: array must not contain infs or NaNs

在一个可重现的示例中,我能够得到相同的错误,我的真实数据集更大:

I was able to get the same error on a reproducible example, my real dataset is the larger:

import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

df2 = pd.DataFrame({ 'C' : pd.Series(1, index = list(range(8)),dtype = 'float32'),
                     'D' : np.array([3] * 8,dtype = 'int32'),
                     'E' : pd.Categorical(["test", "train", "test", "train",
                     "test", "train", "test", "train"])})

df3 = pd.get_dummies(df2)

lm = LinearRegression()

pipe = [('pca',PCA(whiten=True)),
         ('clf' ,lm)]

pipe = Pipeline(pipe)


param_grid = {
    'pca__n_components': np.arange(2,4)}

X = df3.as_matrix()

CLF = GridSearchCV(pipe, param_grid = param_grid, verbose = 1, cv = 3)

y = np.random.normal(0,1,len(X)).reshape(-1,1)

CLF.fit(X,y)

ValueError: array must not contain infs or NaNs

我在y中输入了fit语句,但是它仍然给我同样的错误.但是,这是我的数据集,而不是可重复的示例.

推荐答案

scikit-learn 0.18.1.

查看错误报告 https://github.com/scikit-learn/scikit-learn/issues/7568

描述的解决方法是将PCA与svd_solver='full'一起使用.因此,请尝试以下代码:

Described workaround is to use PCA with svd_solver='full'.So try this code:

pipe = [('pca',PCA(whiten=True,svd_solver='full')),
       ('clf' ,lm)]

这篇关于Sklearn中的PCA-ValueError:数组不得包含infs或NaN的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-25 07:36