为什么gensim.word2vec中两个词袋之间的相似度是这样计算的?

本文介绍了为什么gensim.word2vec中两个词袋之间的相似度是这样计算的?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

def n_similarity(self, ws1, ws2):
    v1 = [self[word] for word in ws1]
    v2 = [self[word] for word in ws2]
    return dot(matutils.unitvec(array(v1).mean(axis=0)), matutils.unitvec(array(v2).mean(axis=0)))

这是我从gensim.word2Vec中摘录的代码，我知道可以通过余弦距离来计算两个单词的相似度，但是两个单词集呢?该代码似乎使用每个wordvec的均值，然后根据两个均值向量的余弦距离进行计算.我对word2vec知之甚少，这种过程是否有一些基础?

This is the code I excerpt from gensim.word2Vec, I know that two single words' similarity can be calculated by cosine distances, but what about two word sets? The code seems to use the mean of each wordvec and then calculated on the two mean vectors' cosine distance. I know few in word2vec, is there some foundations of such process?

推荐答案

取所有单词向量的均值是将它们简化为单个向量的最简单方法，因此可以使用余弦相似度.直觉是，通过将所有单词向量相加，您可以在结果中得到所有它们的一部分(含义).然后，将您除以向量的数量，以使更大的单词包最终不会得到更长的向量(无论如何，它对于余弦相似性并不重要).

Taking the mean of all word vectors is the simplest way of reducing them to a single vector so cosine similarity can be used. The intuition is that by adding up all word vectors you get a bit of all of them (the meaning) in the result. You then divide by the number of vectors so that larger bag of words don't end up with longer vectors (not that it matters for cosine similarity anyway).

还有其他方法可以将整个句子简化为一个向量，这很复杂.我在关于SO的相关问题中写了一些有关它的内容.从那时起，提出了许多新算法.较易访问的参数之一是段落矢量，您应该不会遇到问题了解您是否熟悉word2vec.

There are other ways to reduce an entire sentence to a single vector is a complex one. I wrote a bit about it in a related question on SO. Since then a bunch of new algorithms have been proposed. One of the more accessible ones is Paragraph Vector, which you shouldn't have problems understanding if you are familiar with word2vec.

这篇关于为什么gensim.word2vec中两个词袋之间的相似度是这样计算的?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！