为什么Sklearn PCA需要比新功能(n_components)更多的样本?

本文介绍了为什么Sklearn PCA需要比新功能(n_components)更多的样本?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在使用像这样的Sklearn PCA算法时

When using Sklearn PCA algorithm like this

x_orig = np.random.choice([0,1],(4,25),replace = True)
pca = PCA(n_components=15)
pca.fit_transform(x_orig).shape

我得到输出

(4, 4)

我希望(想要)它:

(4,15)

我明白为什么会这样.在sklearn的文档中(此处)，它说(假设他们的"=="是赋值运算符):

I get why its happening. In the documentation of sklearn (here) it says(assuming their '==' is assignment operator):

n_components == min(n_samples, n_features)

但是他们为什么要这样做?另外，如何将形状为[1,25]的输入直接转换为[1,10](不堆叠虚拟数组)?

But why are they doing this?Also, how can I convert an input with shape [1,25] to [1,10] directly (without stacking dummy arrays)?

推荐答案

每个主成分是数据在数据协方差矩阵的特征向量上的投影.如果样本 n 少于特征，则协方差矩阵仅具有 n 个非零特征值.因此，只有 n 个本征向量/分量才有意义.

Each principal component is the projection of the data on an eigenvector of the data covariance matrix. If you have less samples n than features the covariance matrix has only n non-zero eigenvalues. Thus, there are only n eigenvectors/components that make sense.

原则上，可能有比样本更多的成分，但是多余的成分将是无用的噪声.

In principle it could be possible to have more components than samples, but the superfluous components would be useless noise.

Scikit-learn会引发错误，而不是默默地执行 .这样可以防止用户用脚射击自己.样本少于特征的样本可能表示数据有问题，或者对所涉及的方法有误解.

Scikit-learn raises an error instead of silently doing anything. This prevents users from shooting themselves in the foot. Having less samples than features can indicate a problem with the data, or a misconception about the methods involved.

这篇关于为什么Sklearn PCA需要比新功能(n_components)更多的样本?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！