问题描述
我需要处理超过1,000,000条文本记录.我正在使用CountVectorizer来转换我的数据.我有以下代码.
I need to process more than 1,000,000 text records. I am employing CountVectorizer to transform my data. I have the following code.
TEXT = [data[i].values()[3] for i in range(len(data))] #these are the text records
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=1)
X = vectorizer.fit_transform(TEXT)
X_list = X.toarray().tolist()
运行此代码时,结果为MemoryError
.我拥有的文字记录大部分都是简短的段落(约100个字).向量化似乎非常昂贵.
As I run this code, it turns out MemoryError
. The text records I have are mostly in short paragraphs (~100 words). Vectorization seems to be very expensive.
更新
我向CountVectorizer添加了更多约束,但仍然遇到MemoeryError. feature_names
的长度是2391.
I added more constraints to CountVectorizer but still got MemoeryError. The length of feature_names
is 2391.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=0.003,max_df = 3.05, lowercase = True, stop_words = 'english')
X = vectorizer.fit_transform(TEXT)
feature_names = vectorizer.get_feature_names()
X_tolist = X.toarray().tolist()
Traceback (most recent call last):
File "nlp2.py", line 42, in <module>
X_tolist = X.toarray().tolist()
File "/opt/conda/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 940, in toarray
return self.tocoo(copy=False).toarray(order=order, out=out)
File "/opt/conda/lib/python2.7/site-packages/scipy/sparse/coo.py", line 250, in toarray
B = self._process_toarray_args(order, out)
File "/opt/conda/lib/python2.7/site-packages/scipy/sparse/base.py", line 817, in _process_toarray_args
return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError
为什么会这样以及如何解决呢?谢谢!!
Why is so and how to get around with it? Thank you!!
推荐答案
您的问题是X是稀疏矩阵,每个文档都有一行代表该文档中存在哪些单词.如果您有一百万个文档,总共共有2391个不同的单词(问题所提供的feature_names的长度),则x密集版本中的条目总数约为 20亿,足以导致潜在的内存错误.
Your problem is that X is a sparse matrix with one row for each document representing which words are present in that document. If you have a million documents with a total of 2391 distinct words in all (length of feature_names as provided in your question), the total number of entries in the dense version of x would be about two billion, enough to potentially cause a memory error.
问题出在这行X_list = X.toarray().tolist()
上,它将X转换为密集数组.您没有足够的存储空间,应该没有它就可以做您想做的事情(因为X的稀疏版本具有您需要的所有信息.
The problem is with this line X_list = X.toarray().tolist()
which converts X to a dense array. You don't have enough memory for that, and there should be a way to do what you are trying to do without it, (as the sparse version of X has all the information that you need.
这篇关于sklearn中的矢量化似乎非常耗费内存.为什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!