确保gensim针对相同数据的不同运行生成相同的Word2Vec模型

本文介绍了确保gensim针对相同数据的不同运行生成相同的Word2Vec模型的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在> LDA模型生成不同的每当我在同一主体上训练主题时，通过设置np.random.seed(0)，LDA模型将始终以完全相同的方式进行初始化和训练.

In LDA model generates different topics everytime i train on the same corpus , by setting the np.random.seed(0), the LDA model will always be initialized and trained in exactly the same way.

gensim中的Word2Vec模型是否相同?通过将随机种子设置为常数，在同一数据集上运行的不同种子会产生相同的模型吗?

Is it the same for the Word2Vec models from gensim? By setting the random seed to a constant, would the different run on the same dataset produce the same model?

但是奇怪的是，它已经在不同情况下为我提供了相同的向量.

But strangely, it's already giving me the same vector at different instances.

>>> from nltk.corpus import brown
>>> from gensim.models import Word2Vec
>>> sentences = brown.sents()[:100]
>>> model = Word2Vec(sentences, size=10, window=5, min_count=5, workers=4)
>>> model[word0]
array([ 0.04985042,  0.02882229, -0.03625415, -0.03165979,  0.06049283,
        0.01207791,  0.04722737,  0.01984878, -0.03026265,  0.04485954], dtype=float32)
>>> model = Word2Vec(sentences, size=10, window=5, min_count=5, workers=4)
>>> model[word0]
array([ 0.04985042,  0.02882229, -0.03625415, -0.03165979,  0.06049283,
        0.01207791,  0.04722737,  0.01984878, -0.03026265,  0.04485954], dtype=float32)
>>> model = Word2Vec(sentences, size=20, window=5, min_count=5, workers=4)
>>> model[word0]
array([ 0.02596745,  0.01475067, -0.01839622, -0.01587902,  0.03079717,
        0.00586761,  0.02367715,  0.00930568, -0.01521437,  0.02213679,
        0.01043982, -0.00625582,  0.00173071, -0.00235749,  0.01309298,
        0.00710233, -0.02270884, -0.01477827,  0.01166443,  0.00283862], dtype=float32)
>>> model = Word2Vec(sentences, size=20, window=5, min_count=5, workers=4)
>>> model[word0]
array([ 0.02596745,  0.01475067, -0.01839622, -0.01587902,  0.03079717,
        0.00586761,  0.02367715,  0.00930568, -0.01521437,  0.02213679,
        0.01043982, -0.00625582,  0.00173071, -0.00235749,  0.01309298,
        0.00710233, -0.02270884, -0.01477827,  0.01166443,  0.00283862], dtype=float32)
>>> exit()
alvas@ubi:~$ python
Python 2.7.11 (default, Dec 15 2015, 16:46:19)
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from nltk.corpus import brown
>>> from gensim.models import Word2Vec
>>> sentences = brown.sents()[:100]
>>> model = Word2Vec(sentences, size=10, window=5, min_count=5, workers=4)
>>> word0 = sentences[0][0]
>>> model[word0]
array([ 0.04985042,  0.02882229, -0.03625415, -0.03165979,  0.06049283,
        0.01207791,  0.04722737,  0.01984878, -0.03026265,  0.04485954], dtype=float32)
>>> model = Word2Vec(sentences, size=20, window=5, min_count=5, workers=4)
>>> model[word0]
array([ 0.02596745,  0.01475067, -0.01839622, -0.01587902,  0.03079717,
        0.00586761,  0.02367715,  0.00930568, -0.01521437,  0.02213679,
        0.01043982, -0.00625582,  0.00173071, -0.00235749,  0.01309298,
        0.00710233, -0.02270884, -0.01477827,  0.01166443,  0.00283862], dtype=float32)

是真的，那么默认的随机种子是固定的吗?如果是，默认的随机种子号是多少?还是因为我正在测试小型数据集?

Is it true then that the default random seed is fixed? If so, what is the default random seed number? Or is it because I'm testing on a small dataset?

如果确实是随机种子是固定的，并且对相同数据的不同运行返回相同的向量，则将非常感谢与规范代码或文档的链接.

If it's true that the the random seed is fixed and different runs on the same data returns the same vectors, a link to a canonical code or documentation would be much appreciated.

推荐答案

是的，默认随机种子固定为1，如作者在 https://radimrehurek.com/gensim/models/word2vec.html .每个单词的向量都使用单词+ str(seed)的连接的哈希值进行初始化.

Yes, default random seed is fixed to 1, as described by the author in https://radimrehurek.com/gensim/models/word2vec.html. Vectors for each word are initialised using a hash of the concatenation of word + str(seed).

使用的哈希函数是Python的基本哈希函数，如果两台机器的哈希值不同，则可以产生不同的结果

Hashing function used, however, is Python’s rudimentary built in hash function and can produce different results if two machines differ in

32位和64位，参考
python版本，参考
不同的操作系统/解释器， reference1 ，reference2

32 vs 64 bit, reference
python versions, reference
different Operating Systems/ Interpreters, reference1, reference2

以上列表并不详尽.可以解决您的问题吗?

Above list is not exhaustive. Does it cover your question though?

编辑

如果要确保一致性，可以在word2vec中提供自己的哈希函数作为参数

If you want to ensure consistency, you can provide your own hashing function as an argument in word2vec

一个非常简单(也很糟糕)的例子是:

A very simple (and bad) example would be:

def hash(astring):
   return ord(aastring[0])

model = Word2Vec(sentences, size=10, window=5, min_count=5, workers=4, hashfxn=hash)

print model[sentences[0][0]]

这篇关于确保gensim针对相同数据的不同运行生成相同的Word2Vec模型的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！