本文介绍了解释SciPy的分层聚类树状图的输出? (也许发现了一个错误...)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图弄清楚scipy.cluster.hierarchy.dendrogram的输出是如何工作的...我以为我知道它是如何工作的,因此我能够使用输出来重建树状图,但似乎我不再理解它了或该模块的Python 3版本中存在错误.

I am trying to figure out how the output of scipy.cluster.hierarchy.dendrogram works... I thought I knew how it worked and I was able to use the output to reconstruct the dendrogram but it seems as if I am not understanding it anymore or there is a bug in Python 3's version of this module.

此答案,我如何获得scipy.cluster.hierarchy 生成的树状图的子树,这意味着dendrogram输出字典给出的dict_keys(['icoord', 'ivl', 'color_list', 'leaves', 'dcoord']) w/都具有相同的大小,因此您可以zip它们和plt.plot他们重建树状图.

This answer, how do I get the subtrees of dendrogram made by scipy.cluster.hierarchy, implies that the dendrogram output dictionary gives dict_keys(['icoord', 'ivl', 'color_list', 'leaves', 'dcoord']) w/ all of the same size so you can zip them and plt.plot them to reconstruct the dendrogram.

看起来很简单,当我使用Python 2.7.11时我确实可以恢复工作,但是一旦我升级到Python 3.5.1,我的旧脚本就无法获得相同的结果.

Seems simple enough and I did get it work back when I used Python 2.7.11 but once I upgraded to Python 3.5.1 my old scripts weren't giving me the same results.

我开始通过一个非常简单的可重复示例对集群进行返工,并认为我可能在Python 3.5.1版本的SciPy version 0.17.1-np110py35_1中发现了一个错误.要使用Scikit-learn数据集,大多数人都会从conda发行版中获得该模块.

I started reworking my clusters for a very simple repeatable example and think I may have found a bug in Python 3.5.1's version of SciPy version 0.17.1-np110py35_1. Going to use the Scikit-learn datasets b/c most people have that module from the conda distribution.

为什么这些排列不整齐,为什么我无法以这种方式重建树状图?

# Init
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

# Load data
from sklearn.datasets import load_diabetes

# Clustering
from scipy.cluster.hierarchy import dendrogram, fcluster, leaves_list
from scipy.spatial import distance
from fastcluster import linkage # You can use SciPy one too

%matplotlib inline

# Dataset
A_data = load_diabetes().data
DF_diabetes = pd.DataFrame(A_data, columns = ["attr_%d" % j for j in range(A_data.shape[1])])

# Absolute value of correlation matrix, then subtract from 1 for disimilarity
DF_dism = 1 - np.abs(DF_diabetes.corr())

# Compute average linkage
A_dist = distance.squareform(DF_dism.as_matrix())
Z = linkage(A_dist,method="average")

# I modded the SO code from the above answer for the plot function
def plot_tree( D_dendro, ax ):
    # Set up plotting data
    leaves = D_dendro["ivl"]
    icoord = np.array( D_dendro['icoord'] )
    dcoord = np.array( D_dendro['dcoord'] )
    color_list = D_dendro["color_list"]

    # Plot colors
    for leaf, xs, ys, color in zip(leaves, icoord, dcoord, color_list):
        print(leaf, xs, ys, color, sep="\t")
        plt.plot(xs, ys,  color)

    # Set min/max of plots
    xmin, xmax = icoord.min(), icoord.max()
    ymin, ymax = dcoord.min(), dcoord.max()

    plt.xlim( xmin-10, xmax + 0.1*abs(xmax) )
    plt.ylim( ymin, ymax + 0.1*abs(ymax) )

    # Set up ticks
    ax.set_xticks( np.arange(5, len(leaves) * 10 + 5, 10))
    ax.set_xticklabels(leaves, fontsize=10, rotation=45)

    plt.show()

fig, ax = plt.subplots()
D1 = dendrogram(Z=Z, labels=DF_dism.index, color_threshold=None, no_plot=True)
plot_tree(D_dendro=D1, ax=ax)
attr_1  [ 15.  15.  25.  25.]   [ 0.          0.10333704  0.10333704  0.        ]   g
attr_4  [ 55.  55.  65.  65.]   [ 0.          0.26150727  0.26150727  0.        ]   r
attr_5  [ 45.  45.  60.  60.]   [ 0.          0.4917828   0.4917828   0.26150727]   r
attr_2  [ 35.   35.   52.5  52.5]   [ 0.          0.59107459  0.59107459  0.4917828 ]   b
attr_8  [ 20.    20.    43.75  43.75]   [ 0.10333704  0.65064998  0.65064998  0.59107459]   b
attr_6  [ 85.  85.  95.  95.]   [ 0.          0.60957062  0.60957062  0.        ]   b
attr_7  [ 75.  75.  90.  90.]   [ 0.          0.68142114  0.68142114  0.60957062]   b
attr_0  [ 31.875  31.875  82.5    82.5  ]   [ 0.65064998  0.72066112  0.72066112  0.68142114]   b
attr_3  [  5.       5.      57.1875  57.1875]   [ 0.          0.80554653  0.80554653  0.72066112]   b

这里是一个不带标签的标签,而x轴的icoord

Here's one w/o the labels and just the icoord values for the x-axis

因此,请检查颜色是否正确映射.它说icoord[ 15. 15. 25. 25.]attr_1一起使用,但是基于值,它看起来像与attr_4一起使用.而且,它并不会一直到最后一片叶子(attr_9),并且b/c的长度icoorddcoordivl标签的数量少1.

So check out the colors aren't mapping correctly. It says [ 15. 15. 25. 25.] for the icoord goes with attr_1 but based on the values it looks like it goes with attr_4. Also, it doesn't go to all the way to the last leaf (attr_9) and that's b/c the length of icoord and dcoord is 1 less than the amount of ivl labels.

print([len(x) for x in [leaves, icoord, dcoord, color_list]]) 
#[10, 9, 9, 9]

推荐答案

icoorddcoordcolor_list描述的是链接,而不是叶子. icoorddcoord给出图中每个链接的拱形"(即上下U形或J形)的坐标,而color_list是这些拱形的颜色.在整个图中,icoord等的长度将比ivl的长度小1.

icoord, dcoord and color_list describe the links, not the leaves. icoord and dcoord give the coordinates of the "arches" (i.e. upside-down U or J shapes) for each link in a plot, and color_list is the color of those arches. In a full plot, the length of icoord, etc., will be one less than the length of ivl, as you have observed.

不要尝试将ivl列表与icoorddcoordcolor_list列表对齐.它们与不同的事物相关联.

Don't try to line up the ivl list with the icoord, dcoord and color_list lists. They are associated with different things.

这篇关于解释SciPy的分层聚类树状图的输出? (也许发现了一个错误...)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-20 11:01