如何在gensim.doc2vec中使用infer_vector?

本文介绍了如何在gensim.doc2vec中使用infer_vector?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

def cosine(vector1,vector2):
    cosV12 = np.dot(vector1, vector2) / (linalg.norm(vector1) * linalg.norm(vector2))
    return cosV12
model=gensim.models.doc2vec.Doc2Vec.load('Model_D2V_Game')
string='民生 为了 父亲 我 要 坚强 地 ...'
list=string.split(' ')
vector1=model.infer_vector(doc_words=list,alpha=0.1, min_alpha=0.0001,steps=5)
vector2=model.docvecs.doctag_syn0[0]
print cosine(vector2,vector1)

-0.0232586

我使用火车数据来训练doc2vec模型.然后，使用infer_vector()生成给定给定文档的向量，该文档位于经过训练的数据中.但是它们是不同的.余弦值在doc2vec模型中保存的vector2与infer_vector()生成的vector1之间的距离很小(-0.0232586).但这不合理啊...

I use a train data to train a doc2vec model. Then, I use infer_vector() to generate a vector given a document which is in trained data. But they are different. The value of cosine was so small (-0.0232586) distance between the vector2 which was saved in doc2vec model and the vector1 which was generated by infer_vector(). But this is not reasonable ah ...

我发现我的错误所在.我应该使用'string = u'民生为了父亲我要坚强地...''而不是'string ='民生为了父亲我要坚强地...' '.当我以这种方式纠正时，余弦距离最大为0.889342.

推荐答案

您已经注意到，infer_vector()要求其doc_words参数是令牌列表–与培训中使用的相同类型的令牌化匹配该模型. (将字符串传递给它会导致它仅将每个单独的 character 视为标记化列表中的一个项目，即使其中一些标记是已知的词汇标记，例如'a'和'I'用英语–您不太可能获得良好的结果.)

As you've noticed, infer_vector() requires its doc_words argument to be a list of tokens – matching the same kind of tokenization that was used in training the model. (Passing it a string causes it to just see each individual character as an item in a tokenized list, and even if a few of the tokens are known vocabulary tokens – as with 'a' and 'I' in English – you're unlikely to get good results.)

此外，infer_vector()的默认参数对于许多模型而言可能并非最佳.特别是，较大的steps(至少与模型训练迭代的数量一样大，但可能甚至大很多倍)通常是有帮助的.同样，较小的开始alpha(也许只是批量训练的常用默认值0.025)可能会产生更好的结果.

Additionally, the default parameters of infer_vector() may be far from optimal for many models. In particular, a larger steps (at least as large as the number of model training iterations, but perhaps even many times larger) is often helpful. Also, a smaller starting alpha, perhaps just the common default for bulk training of 0.025, may give better results.

您的关于推理是否从批量训练中得出的向量接近于相同向量的测试是合理的健全性检查，同时对您的推理参数和较早的训练均是整个模型学习数据中的通用模式?但是，由于Doc2Vec的大多数模式固有地使用随机性，或者(在批量训练期间)可能会受到多线程调度抖动引入的随机性的影响，因此您不应期望获得相同的结果.通常，它们越接近，您执行的训练迭代/步骤就越多.

Your test of whether inference gets a vector close to the same vector from bulk-training is a reasonable sanity-check, on both your inference parameters and the earlier training – is the model as a whole learning generalizable patterns in the data? But because most modes of Doc2Vec inherently use randomness, or (during bulk training) can be affected by the randomness introduced by multiple-thread scheduling jitter, you shouldn't expect identical results. They'll just get generally closer, the more training iterations/steps you do.

最后，请注意，Doc2Vec的docvecs组件上的most_similar()方法也可以采用原始向量，以返回最相似的已知向量列表.因此，您可以尝试以下操作...

Finally, note that the most_similar() method on Doc2Vec's docvecs component can also take a raw vector, to give back a list of most-similar already-known vectors. So you can try the following...

ivec = model.infer_vector(doc_words=tokens_list, steps=20, alpha=0.025)
print(model.most_similar(positive=[ivec], topn=10))

...并获得前10个最相似的(doctag, similarity_score)对的排名列表.

...and get a ranked list of the top-10 most-similar (doctag, similarity_score) pairs.

这篇关于如何在gensim.doc2vec中使用infer_vector?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！