本文介绍了如何使用scikit Learn对以下列表列表进行矢量化处理?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用scikit进行向量化,以了解具有列表的列表.我走到有阅读培训文本的地方,然后得到了类似的东西:

I would like to vectorize with scikit learn a list who has lists. I go to the path where I have the training texts I read them and then I obtain something like this:

corpus = [["this is spam, 'SPAM'"],["this is ham, 'HAM'"],["this is nothing, 'NOTHING'"]]

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(analyzer='word')
vect_representation= vect.fit_transform(corpus)
print vect_representation.toarray()

我得到以下信息:

return lambda x: strip_accents(x.lower())
AttributeError: 'list' object has no attribute 'lower'

这也是每个文档末尾的标签的问题,我应该如何对待它们以便进行正确的分类?

Also the problem with this are the labels at the end of each document, how should I treat them in order to do a correct classification?.

推荐答案

对于以后的每个人来说,这解决了我的问题:

For everybody in the future this solve my problem:

corpus = [["this is spam, 'SPAM'"],["this is ham, 'HAM'"],["this is nothing, 'NOTHING'"]]

from sklearn.feature_extraction.text import CountVectorizer
bag_of_words = CountVectorizer(tokenizer=lambda doc: doc, lowercase=False).fit_transform(splited_labels_from_corpus)

这是我使用.toarray()函数时的输出:

And this is the output, when I use the .toarray() function:

[[0 0 1]
 [1 0 0]
 [0 1 0]]

谢谢大家

这篇关于如何使用scikit Learn对以下列表列表进行矢量化处理?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-17 01:09