本文介绍了sklearn中的矢量化似乎非常耗费内存.为什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要处理超过1,000,000条文本记录.我正在使用CountVectorizer来转换我的数据.我有以下代码.

I need to process more than 1,000,000 text records. I am employing CountVectorizer to transform my data. I have the following code.

TEXT = [data[i].values()[3] for i in range(len(data))] #these are the text records

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=1)
X = vectorizer.fit_transform(TEXT)


X_list = X.toarray().tolist()

运行此代码时,结果为MemoryError.我拥有的文字记录大部分都是简短的段落(约100个字).向量化似乎非常昂贵.

As I run this code, it turns out MemoryError. The text records I have are mostly in short paragraphs (~100 words). Vectorization seems to be very expensive.

更新

我向CountVectorizer添加了更多约束,但仍然遇到MemoeryError. feature_names的长度是2391.

I added more constraints to CountVectorizer but still got MemoeryError. The length of feature_names is 2391.

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=0.003,max_df = 3.05, lowercase = True, stop_words = 'english')
X = vectorizer.fit_transform(TEXT)
feature_names = vectorizer.get_feature_names()

X_tolist = X.toarray().tolist()

Traceback (most recent call last):
File "nlp2.py", line 42, in <module>
X_tolist = X.toarray().tolist()
File "/opt/conda/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 940, in toarray
return self.tocoo(copy=False).toarray(order=order, out=out)
File "/opt/conda/lib/python2.7/site-packages/scipy/sparse/coo.py", line 250, in toarray
B = self._process_toarray_args(order, out)
File "/opt/conda/lib/python2.7/site-packages/scipy/sparse/base.py", line 817, in _process_toarray_args
return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError

为什么会这样以及如何解决呢?谢谢!!

Why is so and how to get around with it? Thank you!!

推荐答案

您的问题是X是稀疏矩阵,每个文档都有一行代表该文档中存在哪些单词.如果您有一百万个文档,总共共有2391个不同的单词(问题所提供的feature_names的长度),则x密集版本中的条目总数约为 20亿,足以导致潜在的内存错误.

Your problem is that X is a sparse matrix with one row for each document representing which words are present in that document. If you have a million documents with a total of 2391 distinct words in all (length of feature_names as provided in your question), the total number of entries in the dense version of x would be about two billion, enough to potentially cause a memory error.

问题出在这行X_list = X.toarray().tolist()上,它将X转换为密集数组.您没有足够的存储空间,应该没有它就可以做您想做的事情(因为X的稀疏版本具有您需要的所有信息.

The problem is with this line X_list = X.toarray().tolist() which converts X to a dense array. You don't have enough memory for that, and there should be a way to do what you are trying to do without it, (as the sparse version of X has all the information that you need.

这篇关于sklearn中的矢量化似乎非常耗费内存.为什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-25 07:27