本文介绍了无法使用 gensim 加载 Doc2vec 对象的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 gensim 加载预训练的 Doc2vec 模型,并使用它将段落映射到向量.我指的是 https://github.com/jhlau/doc2vec 和我的预训练模型下载的是英文维基百科DBOW,也在同一个链接中.但是,当我在维基百科上加载 Doc2vec 模型并使用以下代码推断向量时:

将 gensim.models 导入为 g导入编解码器模型=wiki_sg/word2vec.bin"test_docs="test_docs.txt"output_file="test_vectors.txt"#inference 超参数start_alpha=0.01infer_epoch=1000#加载模型test_docs = [x.strip().split() for x in codecs.open(test_docs, "r", "utf-8").readlines()]m = g.Doc2Vec.load(模型)#infer 测试向量输出 = 打开(输出文件,w")对于 test_docs 中的 d:output.write(" ".join([str(x) for x in m.infer_vector(d, alpha=start_alpha, steps=infer_epoch)]) + "\n")输出.flush()输出.close()

我收到一个错误:

/Users/zhangji/Desktop/CSE547/Project/NLP/venv/lib/python2.7/site-packages/smart_open/smart_open_lib.py:402: UserWarning: 此功能已弃用,请使用 smart_open.open反而.有关详细信息,请参阅迁移说明:https://github.com/RaRe-Technologies/smart_open/blob/master/README.rst#migrating-to-the-new-open-function'有关详细信息,请参阅迁移说明:%s' % _MIGRATION_NOTES_URL回溯(最近一次调用最后一次):文件/Users/zhangji/Desktop/CSE547/Project/NLP/AbstractMapping.py",第19行,在<module>output.write(" ".join([str(x) for x in m.infer_vector(d, alpha=start_alpha, steps=infer_epoch)]) + "\n")AttributeError: 'Word2Vec' 对象没有属性 'infer_vector'

我知道有几个线程关于堆栈溢出的 infer_vector 问题,但没有一个线程解决了我的问题.我使用

下载了gensim包

pip install git+https://github.com/jhlau/gensim

另外,在查看gensim包中的源代码后发现,当我使用Doc2vec.load()时,Doc2vec类本身并没有真正的load()函数,但由于它是一个Word2vec 的子类,它调用 Word2vec 中 load() 的 super 方法,然后使模型成为 Word2vec 对象.但是, infer_vector() 函数是 Doc2vec 独有的,在 Word2vec 中不存在,这就是它导致错误的原因.我还尝试将模型 m 转换为 Doc2vec,但出现此错误:

>>>g.Doc2Vec(m)回溯(最近一次调用最后一次):文件<stdin>",第 1 行,在 <module> 中文件/Users/zhangji/Library/Python/2.7/lib/python/site-packages/gensim/models/doc2vec.py",第599行,在__init__self.build_vocab(文件,trim_rule=trim_rule)文件/Users/zhangji/Library/Python/2.7/lib/python/site-packages/gensim/models/word2vec.py",第513行,在build_vocabself.scan_vocab(sentences, trim_rule=trim_rule) # 初步调查文件/Users/zhangji/Library/Python/2.7/lib/python/site-packages/gensim/models/doc2vec.py",第635行,在scan_vocab对于 document_no,枚举中的文档(文档):文件/Users/zhangji/Library/Python/2.7/lib/python/site-packages/gensim/models/word2vec.py",第1367行,在__getitem__return vstack([self.syn0[self.vocab[word].index] for word in words])类型错误:int"对象不可迭代

事实上,我现在想要用 gensim 将一个段落转换为一个向量,使用一个在学术文章上运行良好的预训练模型.由于某些原因,我不想自己训练模型.如果有人能帮我解决问题,我将不胜感激.

顺便说一句,我使用的是python2.7,当前的gensim版本是0.12.4.

谢谢!

解决方案

我会避免在 https://github.com/jhlau/doc2vec,或任何仅使用此类代码加载的已保存 4 年的模型.

维基百科 DBOW 模型也小到 1.4GB.即使在 4 年前,维基百科也有超过 400 万篇文章,而经过训练的 300 维 Doc2Vec 模型为 400 万篇文章提供文档向量至少为 4000000 篇文章 * 300 维 *4 字节/维度 = 4.8GB 大小,甚至不包括模型的其他部分.(因此,该下载显然不是相关论文中提到的 4.3M 文档、300 维模型 - 而是以其他不清楚的方式被截断的内容.)

当前的 gensim 版本是 3.8.3,几周前发布.

使用当前代码和当前 Wikipedia 转储来构建您自己的 Doc2Vec 模型可能需要一些修补,并且需要一个通宵或更长时间的运行时间 - 但随后您就处于现代状态支持的代码,使用现代模型可以更好地理解过去 4 年中使用的单词.(而且,如果您在您感兴趣的确切类型的文档(例如学术文章)的语料库上训练模型,那么词汇、词义以及与您自己的文本预处理的匹配将用于以后推断的文档一切都会好起来的.)

有一个从维基百科构建 Doc2Vec 模型的 Jupyter 笔记本示例,该模型在 gensim 源代码树中具有功能性或非常接近功能性,位于:

https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-wikipedia.ipynb

I am trying to load a pre-trained Doc2vec model using gensim and use it to map a paragraph to a vector. I am referring to https://github.com/jhlau/doc2vec and the pre-trained model I downloaded is the English Wikipedia DBOW, which is also in the same link. However, when I load the Doc2vec model on wikipedia and infer vectors using the following code:

import gensim.models as g
import codecs

model="wiki_sg/word2vec.bin"
test_docs="test_docs.txt"
output_file="test_vectors.txt"

#inference hyper-parameters
start_alpha=0.01
infer_epoch=1000

#load model
test_docs = [x.strip().split() for x in codecs.open(test_docs, "r", "utf-8").readlines()]
m = g.Doc2Vec.load(model)

#infer test vectors
output = open(output_file, "w")
for d in test_docs:
    output.write(" ".join([str(x) for x in m.infer_vector(d, alpha=start_alpha, steps=infer_epoch)]) + "\n")
output.flush()
output.close()

I get an error:

/Users/zhangji/Desktop/CSE547/Project/NLP/venv/lib/python2.7/site-packages/smart_open/smart_open_lib.py:402: UserWarning: This function is deprecated, use smart_open.open instead. See the migration notes for details: https://github.com/RaRe-Technologies/smart_open/blob/master/README.rst#migrating-to-the-new-open-function
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
Traceback (most recent call last):
  File "/Users/zhangji/Desktop/CSE547/Project/NLP/AbstractMapping.py", line 19, in <module>
    output.write(" ".join([str(x) for x in m.infer_vector(d, alpha=start_alpha, steps=infer_epoch)]) + "\n")
AttributeError: 'Word2Vec' object has no attribute 'infer_vector'

I know there are couple of threads regarding the infer_vector issue on stack overflow, but none of them resolved my problem. I downloaded the gensim package using

pip install git+https://github.com/jhlau/gensim

In addition, after I looked at the source code in gensim package, I found that when I use Doc2vec.load(), the Doc2vec class doesn't really have a load() function by itself, but since it is a subclass of Word2vec, it calls the super method of load() in Word2vec and then make the model m a Word2vec object. However, the infer_vector() function is unique to Doc2vec and does not exist in Word2vec, and that's why it is causing the error. I also tried casting the model m to a Doc2vec, but I got this error:

>>> g.Doc2Vec(m)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/zhangji/Library/Python/2.7/lib/python/site-packages/gensim/models/doc2vec.py", line 599, in __init__
    self.build_vocab(documents, trim_rule=trim_rule)
  File "/Users/zhangji/Library/Python/2.7/lib/python/site-packages/gensim/models/word2vec.py", line 513, in build_vocab
    self.scan_vocab(sentences, trim_rule=trim_rule)  # initial survey
  File "/Users/zhangji/Library/Python/2.7/lib/python/site-packages/gensim/models/doc2vec.py", line 635, in scan_vocab
    for document_no, document in enumerate(documents):
  File "/Users/zhangji/Library/Python/2.7/lib/python/site-packages/gensim/models/word2vec.py", line 1367, in __getitem__
    return vstack([self.syn0[self.vocab[word].index] for word in words])
TypeError: 'int' object is not iterable

In fact, all I want with gensim for now is to convert a paragraph to a vector using a pre-trained model that works well on academic articles. For some reasons I don't want to train the models on my own. I would be really grateful if someone can help me resolve the issue.

Btw, I am using python2.7, and the current gensim version is 0.12.4.

Thanks!

解决方案

I would avoid using either the 4-year-old nonstandard gensim fork at https://github.com/jhlau/doc2vec, or any 4-year-old saved models that only load with such code.

The Wikipedia DBOW model there is also suspiciously small at 1.4GB. Wikipedia had well over 4 million articles even 4 years ago, and a 300-dimensional Doc2Vec model trained to have doc-vectors for the 4 million articles would be at least 4000000 articles * 300 dimensions * 4 bytes/dimension = 4.8GB in size, not even counting other parts of the model. (So, that download is clearly not the 4.3M doc, 300-dimensional model mentioned in the associated paper – but something that's been truncated in other unclear ways.)

The current gensim version is 3.8.3, released a few weeks ago.

It'd likely take a bit of tinkering, and an overnight or more runtime, to build your own Doc2Vec model using current code and a current Wikipedia dump - but then you're be on modern supported code, with a modern model that better understands words coming into use in the last 4 years. (And, if you trained a model on a corpus of the exact kind of documents of interest to you – such as academic articles – the vocabulary, word-senses, and match to your own text-preprocessing to be used on later inferred documents will all be better.)

There's a Jupyter notebook example of building a Doc2Vec model from Wikipedia that either functional or very-close-to-functional inside the gensim source tree at:

https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-wikipedia.ipynb

这篇关于无法使用 gensim 加载 Doc2vec 对象的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-15 03:14