本文介绍了Gensim Word2Vec从预训练模型中选择次要词向量集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在gensim中有一个大型的预训练Word2Vec模型,从中我想将预训练的词向量用于我的Keras模型中的嵌入层.

I have a large pretrained Word2Vec model in gensim from which I want to use the pretrained word vectors for an embedding layer in my Keras model.

问题在于嵌入量很大,并且我不需要大多数单词向量(因为我知道哪些单词可以作为Input出现).因此,我想摆脱它们以减小嵌入层的大小.

The problem is that the embedding size is enormous and I don't need most of the word vectors (because I know which words can occure as Input). So I want to get rid of them to reduce the size of my embedding layer.

是否有一种方法可以根据单词白名单保留所需的单词矢量(包括对应的索引!)?

Is there a way to just keep desired wordvectors (including the coresponding indices!), based on a whitelist of words?

推荐答案

感谢此答案(我已经更改了代码,以使其变得更好).您可以使用此代码解决问题.

Thanks to this answer (I've changed the code a little bit to make it better). you can use this code for solving your problem.

我们在restricted_word_set中有所有次要的单词集(可以是列表或集合),而w2v是我们的模型,所以这里是函数:

we have all our minor set of words in restricted_word_set(it can be either list or set) and w2v is our model, so here is the function:

import numpy as np

def restrict_w2v(w2v, restricted_word_set):
    new_vectors = []
    new_vocab = {}
    new_index2entity = []
    new_vectors_norm = []

    for i in range(len(w2v.vocab)):
        word = w2v.index2entity[i]
        vec = w2v.vectors[i]
        vocab = w2v.vocab[word]
        vec_norm = w2v.vectors_norm[i]
        if word in restricted_word_set:
            vocab.index = len(new_index2entity)
            new_index2entity.append(word)
            new_vocab[word] = vocab
            new_vectors.append(vec)
            new_vectors_norm.append(vec_norm)

    w2v.vocab = new_vocab
    w2v.vectors = np.array(new_vectors)
    w2v.index2entity = np.array(new_index2entity)
    w2v.index2word = np.array(new_index2entity)
    w2v.vectors_norm = np.array(new_vectors_norm)

它根据 Word2VecKeyedVectors .

用法:

w2v = KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin.gz", binary=True)
w2v.most_similar("beer")
restricted_word_set = {"beer", "wine", "computer", "python", "bash", "lagers"}
restrict_w2v(w2v, restricted_word_set)
w2v.most_similar("beer")

它也可以用于删除一些单词.

it can be used for removing some words either.

这篇关于Gensim Word2Vec从预训练模型中选择次要词向量集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-24 16:27