本文介绍了sklearn.mixture.DPGMM:意外结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我从DPGMM获得的结果不是我所期望的.例如:

The results I get from DPGMM are not what I expect. E.g.:

>>> import sklearn.mixture
>>> sklearn.__version__
'0.12-git'
>>> data = [[1.1],[0.9],[1.0],[1.2],[1.0], [6.0],[6.1],[6.1]]
>>> m = sklearn.mixture.DPGMM(n_components=5, n_iter=1000, alpha=1)
>>> m.fit(data)
DPGMM(alpha=1, covariance_type='diag', init_params='wmc', min_covar=None,
   n_components=5, n_iter=1000, params='wmc',
   random_state=<mtrand.RandomState object at 0x108a3f168>, thresh=0.01,
   verbose=False)
>>> m.converged_
True
>>> m.weights_
array([ 0.2,  0.2,  0.2,  0.2,  0.2])
>>> m.means_
array([[ 0.62019109],
       [ 1.16867356],
       [ 0.55713292],
       [ 0.36860511],
       [ 0.17886128]])

我希望结果与香草GMM更相似;也就是两个高斯(权重1和6左右),权重不均匀(例如[0.625,0.375]).我希望未使用的"高斯人的权重接近零.

I expected the result to be more similar to the vanilla GMM; that is, two gaussians (around values 1 and 6), with non-uniform weights (like [ 0.625, 0.375]). I expected the "unused" gaussians to have weights near zero.

我使用模型不正确吗?

Am I using the model incorrectly?

我还尝试过更改Alpha而不带来任何运气.

I've also tried changing alpha without any luck.

推荐答案

与sklearn的0.14.1版本没有太大区别.我将使用以下代码来打印DPGMM模型:

Not a big difference with version 0.14.1 of sklearn. I will use following code for printing DPGMM model:

def pprint(model, data):
    idx = np.unique(model.predict(data))
    m_w_cov = [model.means_, model.weights_, model._get_covars()]
    flattened  = map(lambda x: np.array(x).flatten(), m_w_cov)
    filtered = map(lambda x: x[idx], flattened)
    print np.array(filtered)

此功能会过滤掉多余的(空)成分,即那些未用于预测的成分,并打印均值,权重和协方差.

This function filters out redundand (empty) components, i.e. those are not used in predict, and print means, weights and covariations.

如果使用OP问题中的数据进行多次尝试,则可以找到两种不同的结果:

If one make several tries with data from OP question, one can find two different results:

>>> m = sklearn.mixture.DPGMM(n_components=5, n_iter=1000, alpha=1).fit(data)
>>> m.predict(data)
array([0, 0, 0, 0, 0, 1, 1, 1])
>>> pprint(m, data)
[[  0.62019109   1.16867356]
 [  0.10658447   0.19810279]
 [  1.08287064  12.43049771]]

>>> m = sklearn.mixture.DPGMM(n_components=5, n_iter=1000, alpha=1).fit(data)
>>> m.predict(data)
array([1, 1, 1, 0, 1, 0, 0, 0])
>>> pprint(m, data)
[[  1.24122696   0.64252404]
 [  0.17157736   0.17416976]
 [ 11.51813929   1.07829109]]

然后,我们可以猜测出意外结果的原因在于以下事实:一些中间结果(在我们的例子中为1.2)在类之间迁移,并且方法无法推断正确的模型参数.原因之一是聚类抛物线,对于我们的簇来说,alpha太大,每个簇仅包含3个元素,我们可以通过减少此抛物线来尝试更好的方法,0.1会给出更稳定的结果:

then one can guess that unexpected result causes lie in the fact that some of intermediate results (1.2 in our case) migrate between classes, and method is unable to infer correct model paramethers. One reason is that clustering paramether, alpha is too big for our clusters, containing only 3 elements each, we can try better by reducing this paramether, 0.1 will give more stable results:

>>> m = sklearn.mixture.DPGMM(n_components=5, n_iter=1000, alpha=.1).fit(data)
>>> m.predict(data)
array([1, 1, 1, 1, 1, 0, 0, 0])

但是根本原因在于DPGMM方法的随机性,在小集群的情况下,该方法无法推断模型结构.如果我们将观察次数扩展4次,情况会变得更好,方法的行为也会更加符合预期:

But the root cause lies in stohastic nature of DPGMM method, method is unabile to infer model structure in case of small clusters. Things become better, and method behave more as expected, if we extend observations 4 times:

>>> m = sklearn.mixture.DPGMM(n_components=5, n_iter=1000, alpha=1).fit(data*4)
>>> pprint(m, data)
[[ 0.90400296  5.46990901]
 [ 0.11166431  0.24956023]
 [ 1.02250372  1.31278926]]

最后,请谨慎选择适合方法的方法,并注意某些ML方法在数据集较小或偏斜的情况下不能很好地工作的事实.

In conclusion, be careful with method fitting paramethers, and aware of fact that some ML methods do not work well in case of small or skewed datasets.

这篇关于sklearn.mixture.DPGMM:意外结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-22 14:24