本文介绍了将countvectorizer和tfidfvectorizer都用作特征向量,用KMeans进行文本聚类是否有意义?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从我的csv文件构建包含约1000条注释的特征向量.我的特征向量之一是使用scikit Learn的tfidf矢量化器的tfidf.也可以将count用作特征向量是否有意义,还是我应该使用更好的特征向量?

I am trying to build out my feature vectors from my csv file which contain about 1000 comments. One of my feature vector is tfidf using scikit learn's tfidf vectorizer. Does it make sense to also use count as a feature vector or is there a better feature vector that i should use?

如果我最终还是同时使用Countvectorizer和tfidfvectorizer作为我的功能,我应该如何将它们都适合我的Kmeans模型(特别是km.fit()部分)?目前,我只能将tfidf特征向量拟合到模型中.

And if i do end up using both Countvectorizer and tfidfvectorizer as my features, how should i fit them both into my Kmeans model (specifically the km.fit() part)? For now i am only able to fit the tfidf feature vectors into the model.

这是我的代码:

vectorizer=TfidfVectorizer(min_df=1, max_df=0.9, stop_words='english', decode_error='ignore')
vectorized=vectorizer.fit_transform(sentence_list)

#count_vectorizer=CountVectorizer(min_df=1, max_df=0.9, stop_words='english', decode_error='ignore')
#count_vectorized=count_vectorizerfit_transform(sentence_list)

km=KMeans(n_clusters=num_clusters, init='k-means++',n_init=10, verbose=1)
km.fit(vectorized)

推荐答案

基本上,您正在做的是查找文本文档的数字表示形式(功能工程).在某些问题中,计数工作更好,而在另一些问题中,tfidf表示形式是最佳选择.您应该同时尝试两者.虽然这两种表示形式非常相似,因此携带的信息大致相同,但有可能通过使用全套功能(tfidf +计数)获得更好的精度.通过在此特征空间中搜索,可能会更接近真实模型.

Essentially what you are doing is finding a numeric representation of your text documents (feature engineering). In some problems the counts work better and in some others the tfidf representation is the best choice. You should really try them both. While the two representations are very similar and therefore carry approximately the same information, it could be the case that you will get better precision by using the full set of features(tfidf+counts). It is possible that you can get closer to the true model by searching in this feature space.

这是您可以水平堆叠要素的方式:

This is how you can horizontally stack your features:

import scipy.sparse

X = scipy.sparse.hstack([vectorized, count_vectorized])

然后您可以做:

model.fit(X, y)  # y is optional in some models

这篇关于将countvectorizer和tfidfvectorizer都用作特征向量,用KMeans进行文本聚类是否有意义?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-25 07:27