sklearn:文本分类交叉验证中的矢量化

本文介绍了sklearn:文本分类交叉验证中的矢量化的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个关于在 sklearn 中的文本分类中使用交叉验证的问题.在交叉验证之前对所有数据进行矢量化是有问题的，因为分类器会看到"测试数据中出现的词汇.Weka 有过滤分类器来解决这个问题.这个函数的 sklearn 等价物是什么?我的意思是对于每个折叠，特征集都会不同，因为训练数据不同.

解决方案

scikit-learn 解决这个问题的方法是交叉验证一个 Pipeline 估计器，例如:

>>>从 sklearn.cross_validation 导入 cross_val_score>>>从 sklearn.feature_extraction.text 导入 TfidfVectorizer>>>从 sklearn.pipeline 导入管道>>>从 sklearn.svm 导入 LinearSVC>>>clf = Pipeline([('vect', TfidfVectorizer()), ('svm', LinearSVC())])

clf 现在是一个复合估计器，可以进行特征提取和 SVM 模型拟合.给定一个文档列表(即一个普通的 Python list 字符串)documents 和它们的标签 y，调用

>>>cross_val_score(clf，文件，y)

将在每个折叠中单独进行特征提取，以便每个 SVM 只知道其 (k-1) 折叠训练集的词汇.

I have a question about using cross validation in text classification in sklearn. It is problematic to vectorize all data before cross validation, because the classifier would have "seen" the vocabulary occurred in the test data. Weka has filtered classifier to solve this problem. What is the sklearn equivalent for this function? I mean for each fold, the feature set would be different because the training data are different.

解决方案

The scikit-learn solution to this problem is to cross-validate a Pipeline of estimators, e.g.:

>>> from sklearn.cross_validation import cross_val_score
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.svm import LinearSVC
>>> clf = Pipeline([('vect', TfidfVectorizer()), ('svm', LinearSVC())])

clf is now a composite estimator that does feature extraction and SVM model fitting. Given a list of documents (i.e. an ordinary Python list of string) documents and their labels y, calling

>>> cross_val_score(clf, documents, y)

will do feature extraction in each fold separately so that each of the SVMs knows only the vocabulary of its (k-1) folds training set.

这篇关于sklearn:文本分类交叉验证中的矢量化的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！