本文介绍了Doc2vec:TaggedLineDocument()的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此,我正在尝试学习和理解Doc2Vec.我正在关注教程.我的输入是文档列表,即单词列表.这是我的代码:

So,I'm trying to learn and understand Doc2Vec.I'm following this tutorial. My input is a list of documents i.e list of lists of words. This is what my code looks like:

    input = [["word1","word2",..."wordn"],["word1","word2",..."wordn"],...]

    documents = TaggedLineDocument(input)

    model = doc2vec.Doc2Vec(documents,size = 50, window = 10, min_count = 2, workers=2)

但是我遇到了一些unicode错误(尝试谷歌搜索该错误,但是不好):

But I am getting some unicode error(tried googling this error, but no good ):

   TypeError('don\'t know how to handle uri %s' % repr(uri))

有人可以帮我了解我要去哪里错吗?谢谢 !

Can somebody please help me understand where i am going wrong ? Thank you !

推荐答案

TaggedLineDocument应该使用文件路径实例化.确保以一种文档等于一行的格式设置文件.

TaggedLineDocument should be instantiated with a file path. Make sure the file is setup in the format one document equals one line.

documents = TaggedLineDocument('myfile.txt')
documents = TaggedLineDocument('compressed_text.txt.gz')

源代码:

uri(您想使用其实例化TaggedLineDocument)可以是:

The uri (the think you are instantiating TaggedLineDocument with) can be either:

1. a URI for the local filesystem (compressed ``.gz`` or ``.bz2`` files handled automatically):
   `./lines.txt`, `/home/joe/lines.txt.gz`, `file:///home/joe/lines.txt.bz2`
2. a URI for HDFS: `hdfs:///some/path/lines.txt`
3. a URI for Amazon's S3 (can also supply credentials inside the URI):
   `s3://my_bucket/lines.txt`, `s3://my_aws_key_id:key_secret@my_bucket/lines.txt`
4. an instance of the boto.s3.key.Key class.

这篇关于Doc2vec:TaggedLineDocument()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-15 03:14