本文介绍了当我必须手动运行迭代时,纪元在Doc2Vec中意味着什么并进行训练?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图理解Doc2Vec函数中的epochs参数和train函数中的epochs参数.

I am trying to understand the epochs parameter in the Doc2Vec function and epochs parameter in the train function.

在以下代码片段中,我手动设置了4000次迭代的循环.是否需要或在Doc2Vec中传递4000作为纪元参数足够? Doc2Vec中的epochstrain中的时期有何不同?

In the following code snippet, I manually set up a loop of 4000 iterations. Is it required or passing 4000 as epochs parameter in the Doc2Vec enough? Also how epochs in Doc2Vec is different from epochs in train?

documents = Documents(train_set)

model = Doc2Vec(vector_size=100, dbow_words=1, dm=0, epochs=4000,  window=5,
                seed=1337, min_count=5, workers=4, alpha=0.001, min_alpha=0.025)

model.build_vocab(documents)

for epoch in range(model.epochs):
    print("epoch "+str(epoch))
    model.train(documents, total_examples=total_length, epochs=1)
    ckpnt = model_name+"_epoch_"+str(epoch)
    model.save(ckpnt)
    print("Saving {}".format(ckpnt))

此外,权重如何以及何时更新?

Also, how and when are the weights updated?

推荐答案

您不必手动运行迭代,并且不应多次调用train(),除非您是出于非常具体的原因而需要这样做的专家.如果您在复制的某个在线示例中看到了此技术,则该示例可能已过时且具有误导性.

You don't have to manually run the iteration, and you shouldn't call train() more than once unless you're an expert who needs to do so for very specific reasons. If you've seen this technique in some online example you're copying, that example is likely outdated and misleading.

调用一次train(),并将您的首选通过次数作为epochs参数.

Call train() once, with your preferred number of passes as the epochs parameter.

此外,请勿使用低(0.001)的初始alpha学习率,然后将其提高到大于25倍(0.025)的min_alpha值-这不是应该的方式工作,大多数用户根本不需要调整与alpha相关的默认设置. (同样,如果您是从某个地方的在线示例获取此信息-这是一个糟糕的示例.请让他们知道他们在提供错误的建议.)

Also, don't use a starting alpha learning-rate that is low (0.001) that then rises to a min_alpha value 25 times larger (0.025) - that's not how this is supposed to work, and most users shouldn't need to adjust the alpha-related defaults at all. (Again, if you're getting this from an online example somewhere - that's a bad example. Let them know they're giving bad advice.)

此外,4000个训练纪元实在是太庞大了.处理成千上万到数百万个文档时,在出版的作品中通常使用10-20的值.如果您的数据集较小,则它可能无法与Doc2Vec一起使用,但是有时更多的纪元(或更小的vector_size)仍然可以从微小数据中学习到可推广的知识-但仍希望使用更近的纪元(而不是数千个) .

Also, 4000 training epochs is absurdly large. A value of 10-20 is common in published work, when dealing with tens-of-thousands to millions of documents. If your dataset is smaller, it may not work well with Doc2Vec, but sometimes more epochs (or smaller vector_size) can still learn something generalizable from tiny data - but still expect to use closer to dozens of epochs (not thousands).

一个很好的介绍(尽管只有很少的数据集无法与Doc2Vec一起使用)是与gensim捆绑在一起的doc2vec-lee.ipynb Jupyter笔记本,也可以在以下位置在线查看:

A good intro (albeit with a tiny dataset that barely works with Doc2Vec) is the doc2vec-lee.ipynb Jupyter notebook that's bundled with gensim, and also viewable online at:

https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb

祝你好运!

这篇关于当我必须手动运行迭代时,纪元在Doc2Vec中意味着什么并进行训练?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-18 15:22