问题描述
我可以使用文档中示例中的代码,其中 fit_transform() 函数的输入是一个句子列表,即:
I am able to use code as in the example from the documentation, where the input to the fit_transform() function is a list of sentences, i.e:
corpus = [
'this is the first document',
'this is the second second document',
'and the third one',
'is this the first document?'
]
X = vectorizer.fit_transform(corpus)
X = vectorizer.fit_transform(corpus)
并获取预期数据.但是当我尝试用文件列表或文件对象替换语料库时,如文档所示,它可以是:
and get expected data out. But when I try to replace corpus with a list of files, or file objects as the documentation suggests it can be:
"适合(原始文档,y = 无)
" fit(raw_documents, y=None)
Learn a vocabulary dictionary of all tokens in the raw documents.
Parameters :
raw_documents : iterable
An iterable which yields either str, unicode or file objects.
Returns :
self :
"
......所以我认为我对管道的理解缺少一些东西.给定一个我想要 CountVectorize 的文件目录,我该怎么做?如果我尝试提供文件对象列表,如 [open(file,'r')],我得到的错误消息是文件对象没有较低的功能.
.. so there is something missing in my understanding of the pipeline, I think. Given a directory of files that I would like to CountVectorize, how do I do that?if I try to feed a list of file objects, as [open(file,'r')] the error message I get is that file objects have no lower function.
推荐答案
设置矢量化器的input
构造函数参数到filename
或file
.它的默认值是 content
,它假设您已经将文件读入内存.
Set the vectorizer's input
constructor parameter to either filename
or file
. Its default value is content
, which assumes you've already read the files into memory.
这篇关于使用 sklearn.feature_extraction.text CountVectorizer 时从文件中读取文档的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!