本文介绍了将sklearn RandomizedPCA与稀疏和密集矩阵一起使用时的结果不同的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Randomized PCA具有稀疏和密集矩阵时,我得到的结果是不同的:

I am getting different results when Randomized PCA with sparse and dense matrices:

import numpy as np
import scipy.sparse as scsp
from sklearn.decomposition import RandomizedPCA

x = np.matrix([[1,2,3,2,0,0,0,0],
               [2,3,1,0,0,0,0,3],
               [1,0,0,0,2,3,2,0],
               [3,0,0,0,4,5,6,0],
               [0,0,4,0,0,5,6,7],
               [0,6,4,5,6,0,0,0],
               [7,0,5,0,7,9,0,0]])

csr_x = scsp.csr_matrix(x)

s_pca = RandomizedPCA(n_components=2)
s_pca_scores = s_pca.fit_transform(csr_x)
s_pca_weights = s_pca.explained_variance_ratio_

d_pca = RandomizedPCA(n_components=2)
d_pca_scores = s_pca.fit_transform(x)
d_pca_weights = s_pca.explained_variance_ratio_

print 'sparse matrix scores {}'.format(s_pca_scores)
print 'dense matrix scores {}'.format(d_pca_scores)
print 'sparse matrix weights {}'.format(s_pca_weights)
print 'dense matrix weights {}'.format(d_pca_weights)

结果:

sparse matrix scores [[  1.90912166   2.37266113]
 [  1.98826835   0.67329466]
 [  3.71153199  -1.00492408]
 [  7.76361811  -2.60901625]
 [  7.39263662  -5.8950472 ]
 [  5.58268666   7.97259172]
 [ 13.19312194   1.30282165]]
dense matrix scores [[-4.23432815  0.43110596]
 [-3.87576857 -1.36999888]
 [-0.05168291 -1.02612363]
 [ 3.66039297 -1.38544473]
 [ 1.48948352 -7.0723618 ]
 [-4.97601287  5.49128164]
 [ 7.98791603  4.93154146]]
sparse matrix weights [ 0.74988508  0.25011492]
dense matrix weights [ 0.55596761  0.44403239]

密集版本可以使用正常PCA获得结果,但是当矩阵稀疏时会发生什么呢?为什么结果不同?

The dense version gives the results with normal PCA, but what is going on when the matrix is sparse? Why are results different?

推荐答案

对于稀疏数据,RandomizedPCA不会使数据居中(去除均值),因为它可能会消耗大量内存.这可能解释了您观察到的情况.

In the case of the sparse data, the RandomizedPCA does not center the data (mean removal) as it might blow up the memory usage. That probably explains what you observe.

我同意此功能"的文献不多.请随时在github上报告问题以跟踪并改进文档.

I agree this "feature" is poorly documented. Please feel free to report an issue on github to track it and improve the doc.

编辑:我们修复了scikit-learn 0.15中的差异:稀疏数据不建议使用RandomizedPCA.而是使用与PCA相同的TruncatedSVD,而无需尝试使数据居中.

Edit: we fixed that discrepancy in scikit-learn 0.15: RandomizedPCA is not deprecated for sparse data. Instead use TruncatedSVD that does the same as PCA without trying to center the data.

这篇关于将sklearn RandomizedPCA与稀疏和密集矩阵一起使用时的结果不同的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-22 08:46