本文介绍了使用 sklearn.feature_extraction.text CountVectorizer 时从文件中读取文档的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我可以使用文档中示例中的代码,其中 fit_transform() 函数的输入是一个句子列表,即:

I am able to use code as in the example from the documentation, where the input to the fit_transform() function is a list of sentences, i.e:

corpus = [
   'this is the first document',
   'this is the second second document',
   'and the third one',
   'is this the first document?'
]

X = vectorizer.fit_transform(corpus)

X = vectorizer.fit_transform(corpus)

并获取预期数据.但是当我尝试用文件列表或文件对象替换语料库时,如文档所示,它可以是:

and get expected data out. But when I try to replace corpus with a list of files, or file objects as the documentation suggests it can be:

"适合(原始文档,y = 无)

" fit(raw_documents, y=None)

Learn a vocabulary dictionary of all tokens in the raw documents.
Parameters :    
raw_documents : iterable
    An iterable which yields either str, unicode or file objects.
Returns :   
self :

"

......所以我认为我对管道的理解缺少一些东西.给定一个我想要 CountVectorize 的文件目录,我该怎么做?如果我尝试提供文件对象列表,如 [open(file,'r')],我得到的错误消息是文件对象没有较低的功能.

.. so there is something missing in my understanding of the pipeline, I think. Given a directory of files that I would like to CountVectorize, how do I do that?if I try to feed a list of file objects, as [open(file,'r')] the error message I get is that file objects have no lower function.

推荐答案

设置矢量化器的input 构造函数参数filenamefile.它的默认值是 content,它假设您已经将文件读入内存.

Set the vectorizer's input constructor parameter to either filename or file. Its default value is content, which assumes you've already read the files into memory.

这篇关于使用 sklearn.feature_extraction.text CountVectorizer 时从文件中读取文档的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-25 07:15