sklearn中的矢量化似乎非常耗费内存.为什么?

本文介绍了sklearn中的矢量化似乎非常耗费内存.为什么?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要处理超过1,000,000条文本记录.我正在使用CountVectorizer来转换我的数据.我有以下代码.

I need to process more than 1,000,000 text records. I am employing CountVectorizer to transform my data. I have the following code.

TEXT = [data[i].values()[3] for i in range(len(data))] #these are the text records

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=1)
X = vectorizer.fit_transform(TEXT)


X_list = X.toarray().tolist()

运行此代码时，结果为MemoryError.我拥有的文字记录大部分都是简短的段落(约100个字).向量化似乎非常昂贵.

As I run this code, it turns out MemoryError. The text records I have are mostly in short paragraphs (~100 words). Vectorization seems to be very expensive.

更新

我向CountVectorizer添加了更多约束，但仍然遇到MemoeryError. feature_names的长度是2391.

I added more constraints to CountVectorizer but still got MemoeryError. The length of feature_names is 2391.

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=0.003,max_df = 3.05, lowercase = True, stop_words = 'english')
X = vectorizer.fit_transform(TEXT)
feature_names = vectorizer.get_feature_names()

X_tolist = X.toarray().tolist()

Traceback (most recent call last):
File "nlp2.py", line 42, in <module>
X_tolist = X.toarray().tolist()
File "/opt/conda/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 940, in toarray
return self.tocoo(copy=False).toarray(order=order, out=out)
File "/opt/conda/lib/python2.7/site-packages/scipy/sparse/coo.py", line 250, in toarray
B = self._process_toarray_args(order, out)
File "/opt/conda/lib/python2.7/site-packages/scipy/sparse/base.py", line 817, in _process_toarray_args
return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError

为什么会这样以及如何解决呢?谢谢！！

Why is so and how to get around with it? Thank you!!

the

sklearn中的矢量化似乎非常耗费内存.为什么?

问题描述

推荐答案