问题描述
当Randomized PCA
具有稀疏和密集矩阵时,我得到的结果是不同的:
I am getting different results when Randomized PCA
with sparse and dense matrices:
import numpy as np
import scipy.sparse as scsp
from sklearn.decomposition import RandomizedPCA
x = np.matrix([[1,2,3,2,0,0,0,0],
[2,3,1,0,0,0,0,3],
[1,0,0,0,2,3,2,0],
[3,0,0,0,4,5,6,0],
[0,0,4,0,0,5,6,7],
[0,6,4,5,6,0,0,0],
[7,0,5,0,7,9,0,0]])
csr_x = scsp.csr_matrix(x)
s_pca = RandomizedPCA(n_components=2)
s_pca_scores = s_pca.fit_transform(csr_x)
s_pca_weights = s_pca.explained_variance_ratio_
d_pca = RandomizedPCA(n_components=2)
d_pca_scores = s_pca.fit_transform(x)
d_pca_weights = s_pca.explained_variance_ratio_
print 'sparse matrix scores {}'.format(s_pca_scores)
print 'dense matrix scores {}'.format(d_pca_scores)
print 'sparse matrix weights {}'.format(s_pca_weights)
print 'dense matrix weights {}'.format(d_pca_weights)
结果:
sparse matrix scores [[ 1.90912166 2.37266113]
[ 1.98826835 0.67329466]
[ 3.71153199 -1.00492408]
[ 7.76361811 -2.60901625]
[ 7.39263662 -5.8950472 ]
[ 5.58268666 7.97259172]
[ 13.19312194 1.30282165]]
dense matrix scores [[-4.23432815 0.43110596]
[-3.87576857 -1.36999888]
[-0.05168291 -1.02612363]
[ 3.66039297 -1.38544473]
[ 1.48948352 -7.0723618 ]
[-4.97601287 5.49128164]
[ 7.98791603 4.93154146]]
sparse matrix weights [ 0.74988508 0.25011492]
dense matrix weights [ 0.55596761 0.44403239]
密集版本可以使用正常PCA获得结果,但是当矩阵稀疏时会发生什么呢?为什么结果不同?
The dense version gives the results with normal PCA, but what is going on when the matrix is sparse? Why are results different?
推荐答案
对于稀疏数据,RandomizedPCA
不会使数据居中(去除均值),因为它可能会消耗大量内存.这可能解释了您观察到的情况.
In the case of the sparse data, the RandomizedPCA
does not center the data (mean removal) as it might blow up the memory usage. That probably explains what you observe.
我同意此功能"的文献不多.请随时在github上报告问题以跟踪并改进文档.
I agree this "feature" is poorly documented. Please feel free to report an issue on github to track it and improve the doc.
编辑:我们修复了scikit-learn 0.15中的差异:稀疏数据不建议使用RandomizedPCA.而是使用与PCA相同的TruncatedSVD,而无需尝试使数据居中.
Edit: we fixed that discrepancy in scikit-learn 0.15: RandomizedPCA is not deprecated for sparse data. Instead use TruncatedSVD that does the same as PCA without trying to center the data.
这篇关于将sklearn RandomizedPCA与稀疏和密集矩阵一起使用时的结果不同的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!