本文介绍了将gensim doc2vec嵌入导出到单独的文件中,以便稍后与keras嵌入层一起使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对gensim有点陌生,现在我正在尝试解决涉及在keras中使用doc2vec嵌入的问题.我无法在keras中找到doc2vec的现有实现-就我所发现的所有示例而言,到目前为止,每个人都只是使用gensim来获取文档嵌入.

I am a bit new to gensim and right now I am trying to solve the problem which involves using the doc2vec embeddings in keras. I wasn't able to find existing implementation of doc2vec in keras - as far as I see in all examples I found so far everyone just uses the gensim to get the document embeddings.

一旦我在gensim中训练了doc2vec模型,我就需要以某种方式将嵌入权重从genim导出到keras中,并且还不清楚如何做到这一点.我看到了

Once I trained my doc2vec model in gensim I need to export embeddings weights from genim into keras somehow and it is not really clear on how to do that. I see that

model.syn0

据称给出了word2vec嵌入权重(根据).但目前尚不清楚如何对文档嵌入进行相同的导出.有什么建议吗?

Supposedly gives the word2vec embedding weights (according to this). But it is unclear how to do the same export for document embeddings. Any advise?

我知道一般来说,我可以直接从gensim模型中直接获取每个文档的嵌入,但是稍后我想对keras中的嵌入层进行微调,因为doc嵌入将用作较大任务的一部分,因此他们可能会稍作调整.

I know that in general I can just get the embeddings for each document directly from gensim model but I want to fine-tune the embedding layer in keras later on, since doc embeddings will be used as a part of a larger task hence they might be fine-tuned a bit.

推荐答案

我知道了.

假设您已经训练了gensim模型并将字符串标签用作文档ID:

Assuming you already trained the gensim model and used string tags as document ids:

#get vector of doc
model.docvecs['2017-06-24AEON']
#raw docvectors (all of them)
model.docvecs.doctag_syn0
#docvector names in model
model.docvecs.offset2doctag

您可以按如下所示将此文档向量导出到keras嵌入层中,前提是您的DataFrame df包含了所有文档.请注意,在嵌入矩阵中,您只需要传递整数作为输入.我将数据框中的原始数字用作输入文档的ID.还要注意,嵌入层要求不要触摸索引0-它保留用于屏蔽,因此当我将文档ID作为输入传递给网络时,我需要确保其> 0

You can export this doc vectors into keras embedding layer as below, assuming your DataFrame df has all of the documents out there. Notice that in the embedding matrix you need to pass only integers as inputs. I use raw number in dataframe as the id of the doc for input. Also notice that embedding layer requires to not touch index 0 - it is reserved for masking, so when I pass the doc id as input to my network I need to ensure it is >0

#creating embedding matrix
embedding_matrix = np.zeros((len(df)+1, text_encode_dim))
for i, row in df.iterrows():
    embedding = modelDoc2Vec.docvecs[row['docCode']]
    embedding_matrix[i+1] = embedding

#input with id of document
doc_input = Input(shape=(1,),dtype='int16', name='doc_input')
#embedding layer intialized with the matrix created earlier
embedded_doc_input = Embedding(output_dim=text_encode_dim, input_dim=len(df)+1,weights=[embedding_matrix], input_length=1, trainable=False)(doc_input)

更新

在2017年末之后,随着Keras 2.0 API的引入,最后一行应更改为:

UPDATE

After late 2017, with the introduction of Keras 2.0 API very last line should be changed to:

embedded_doc_input = Embedding(output_dim=text_encode_dim, input_dim=len(df)+1,embeddings_initializer=Constant(embedding_matrix), input_length=1, trainable=False)(doc_input)

这篇关于将gensim doc2vec嵌入导出到单独的文件中,以便稍后与keras嵌入层一起使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-15 03:31