一个人应该使用R中的距离（不相似）或相似度进行聚类吗？

本文介绍了一个人应该使用R中的距离（不相似）或相似度进行聚类吗？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在处理集群问题，R中的 proxy 软件包同时提供了dist和simil函数。

I'm doing a cluster problem, and the proxy package in R provides both dist and simil functions.

出于我的目的，我需要一个距离矩阵，因此我最初使用dist，这是代码：

For my purpose I need a distance matrix, so I initially used dist, and here's the code:

distanceMatrix <- dist(dfm[,-1], method='Pearson')
clusters <- hclust(distanceMatrix)  
clusters$labels <- dfm[,1]#colnames(dfm)[-1]
plot(clusters, labels=clusters$labels)

但是在我绘制图像之后我发现群集结果不是我期望的那样，因为我知道它应该是什么样子。

But after I ploted the image I found that the cluster result is not the way I expecte it to be, since I know what it should look like.

所以我尝试了simil，代码是像：

So I tried simil instead, and the code is like:

distanceMatrix <- simil(dfm[,-1], method='Pearson')
clusters <- hclust(pr_simil2dist(distanceMatrix))   
clusters$labels <- dfm[,1]#colnames(dfm)[-1]
plot(clusters, labels=clusters$labels)

此代码使用s计算相似度矩阵imil，然后使用pr_simil2dist将其转换为距离矩阵，然后将其绘制并得到预期的结果！

This code computes a similarity matrix using simil, then convert it to distance matrix using pr_simil2dist, then I plot it and get the result I expected !

我对dist和simil之间的关系感到困惑。根据文档中描述的关系，两个代码片段的结果是否应该相同？

I'm confused about the relationship between dist and simil. According to the relationship described in the documentation, shouldn't the two code snippet has the same result?

我在哪里错了？

编辑：

您可以使用以下值的dfm尝试我的代码，对不起缩进。

You can try my code with dfm of the following value, sorry for the bad indentation.

                             Blog china kids music yahoo want wrong
                         Gawker     0    1     0     0    7     0
                  Read/WriteWeb     2    0     1     3    1     1
                 WWdN: In Exile     0    2     4     0    0     0
           ProBlogger Blog Tips     0    0     0     0    2     0
                    Seth's Blog     0    0     1     0    3     1
 The Huffington Post | Raw Feed     0    6     0     0   14     5

编辑：

实际上，样本数据是使用 tail 从一个非常大的数据框中获取的，使用dist和simil + pr_simil2dist可以获得完全不同的矩阵。完整数据可在

Actually the sample data is taken from a very big data frame using tail, and I get completely different matrix using dist and simil+pr_simil2dist. The full data can found here.

如果我犯了其他愚蠢的错误，这是我函数的完整代码：

In case I made other silly mistakes, here's the full code of my function:

我用来读取数据的代码：

The code I use to read in data:

dfm<- read.table(filename, header=T, sep='\t', quote='')

集群代码：

hcluster <- function(dfm, distance='Pearson'){
    dfm <- tail(dfm)[,c(1:7)] # I use this to give the sample data.
    distanceMatrix <- simil(dfm[,-1], method=pearson)
    clusters <- hclust(pr_simil2dist(distanceMatrix))   
    clusters$labels <- dfm[,1]#colnames(dfm)[-1]
    plot(clusters, labels=clusters$labels)
}

使用dist的矩阵：

           94         95         96         97         98
95 -0.2531580                                            
96 -0.2556859 -0.4629100                                 
97  0.9897783 -0.1581139 -0.2927700                      
98  0.8742800 -0.2760788 -0.1022397  0.9079594           
99  0.9114339 -0.5020405 -0.2810414  0.8713293  0.8096980

使用simil + pr_simil2dist的矩阵：

Matrix using simil+pr_simil2dist:

           94         95         96         97         98
95 1.25315802                                            
96 1.25568595 1.46291005                                 
97 0.01022173 1.15811388 1.29277002                      
98 0.12572004 1.27607882 1.10223973 0.09204062           
99 0.08856608 1.50204055 1.28104139 0.12867065 0.19030202

您可以看到两个矩阵中的对应元素加起来为1，我认为这是不对的。因此，一定是我做错了。

You can see that corresponding elements in the two matrices add up to 1, which I think is not right. So there must be something I'm doing wrong.

编辑：

在读取中指定名称后.table函数读取数据帧时，dist方法和simil + pr_simil2dist方法给出相同的正确结果。 从技术上讲，问题已解决，但我不知道为什么我原来处理数据框的方式与dist和simil有关。

After I specify names in the read.table function to read in the data frame, the dist way and simil+pr_simil2dist way give the same correct result. So technically problem solved, but I don't know why my original way of handling data frame have anything to do with dist and simil.

任何人都有线索吗？

推荐答案

我不确定您的意思不符合预期。如果我通过 proxy :: dist（）或通过 simil（）计算距离/相似度矩阵并将其转换为相似性得到相同的矩阵：

I'm not sure what you mean by not as per expected. If I compute the distance/similarity matrix via proxy::dist() or via simil() and convert to a dissimilarity I get the same matrix:

> dist(dfm, method='Pearson')
                                  Gawker Read/WriteWeb WWdN: In Exile ProBlogger Blog Tips Seth's Blog
Read/WriteWeb                  0.2662006                                                              
WWdN: In Exile                 0.2822594     0.2662006                                                
ProBlogger Blog Tips           0.2928932     0.5917517      0.6984887                                 
Seth's Blog                    0.2662006     0.2928932      0.4072510            0.2928932            
The Huffington Post | Raw Feed 0.1835034     0.2312939      0.2662006            0.2928932   0.2312939

> pr_simil2dist(simil(dfm, method = "pearson"))
                                  Gawker Read/WriteWeb WWdN: In Exile ProBlogger Blog Tips Seth's Blog
Read/WriteWeb                  0.2662006                                                              
WWdN: In Exile                 0.2822594     0.2662006                                                
ProBlogger Blog Tips           0.2928932     0.5917517      0.6984887                                 
Seth's Blog                    0.2662006     0.2928932      0.4072510            0.2928932            
The Huffington Post | Raw Feed 0.1835034     0.2312939      0.2662006            0.2928932   0.2312939

和

d1 <- dist(dfm, method='Pearson')
d2 <- pr_simil2dist(simil(dfm, method = "pearson"))
h1 <- hclust(d1)
h2 <- hclust(d2)
layout(matrix(1:2, ncol = 2))
plot(h1)
plot(h2)
layout(1)
all.equal(h1, h2)

最后一行产生：

> all.equal(h1, h2)
[1] "Component 6: target, current do not match when deparsed"

告诉我们 h1 和 h2 完全相同，除了匹配的函数调用（很明显，因为我们在相应的调用中分别使用了 d1 和 d2 ）。

which is telling us that h1 and h2 are exactly the same except for the matched function call (obviously as we used d1 and d2 in the respective calls).

产生的数字是：

如果正确设置了对象，则无需摆弄标签。查看 read.table（）的 row.names 参数，以了解如何指定将列用作

If you set your object up correctly, then you won't need to fiddle with the labels. Look at the row.names argument to read.table() to see how to specify a column be used as the row labels when the data are read in.

所有这些操作都使用：

dfm <- structure(list(china = c(0L, 2L, 0L, 0L, 0L, 0L), kids = c(1L, 
0L, 2L, 0L, 0L, 6L), music = c(0L, 1L, 4L, 0L, 1L, 0L), yahoo = c(0L, 
3L, 0L, 0L, 0L, 0L), want = c(7L, 1L, 0L, 2L, 3L, 14L), wrong = c(0L, 
1L, 0L, 0L, 1L, 5L)), .Names = c("china", "kids", "music", "yahoo", 
"want", "wrong"), class = "data.frame", row.names = c("Gawker", 
"Read/WriteWeb", "WWdN: In Exile", "ProBlogger Blog Tips", "Seth's Blog", 
"The Huffington Post | Raw Feed"))

这篇关于一个人应该使用R中的距离（不相似）或相似度进行聚类吗？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！