Sklearn - 不能在随机林分类器中使用编码数据

本文介绍了Sklearn - 不能在随机林分类器中使用编码数据的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我很喜欢scikit学习。我正在尝试使用预处理。 OneHotEncoder来编码我的训练和测试数据。编码后，我尝试使用该数据训练随机森林分类器。但是，在装配时会出现以下错误。
（这里的错误跟踪）

  99 model.fit（X_train，y_train）
 100 preds = model.predict_proba（X_cv）[:, 1] 
 101 
 
 C：\Python27\lib\site- packages\sklearn\ens\forest.pyc适合（self，X，y，sample_weight）
 288 
 289＃预计算一些数据
  - > 290 X，y = check_arrays（X，y，sparse_format =dense）
 291 if（getattr（X，dtype，None）！= DTYPE或
 292 X.ndim！= 2或
 
 C：\Python27\lib\site- packages\sklearn\utils\validation.pyc in check_arrays（*数组，**选项）
 200 array = array .tocsc（）
 201 elif sparse_format =='dense'：
  - > 202 raise TypeError（'一个稀疏矩阵传递，但密集'
 203'数据是必需的$ to 
 
 
 $ b TypeError：传递一个稀疏矩阵，但需要密集的数据。使用X.toarray（）转换成密集的numpy数组

我尝试使用X.toarray将稀疏矩阵转换为密集（）和X.todense（）
但是当我这样做，我得到以下错误tra ce。

  99 model.fit（X_train.toarray（），y_train）
 100 preds = model.predict_proba X_cv）[:, 1] 
 101 
 
 C：\Python27\lib\site-packages\scipy\sparse\compressed.pyc in toarray（self）
 548 
 549 def toarray（self）：
  - > 550 return self.tocoo（copy = False）.toarray（）
 551 
 552 ######################### ################################### 
 
 C：\Python27\\ tolray（self）
 236 
 237 def toarray（self）：
  - > 238 B = np.zeros（self.shape，dtype = self.dtype）
 239 M，N = self.shape 
 240 coo_todense（M，N，self.nnz，self.row，self。 col，self.data，B.ravel（））
 
 ValueError：数组太大。

任何人都可以帮助我解决这个问题。

谢谢

解决方案

sklearn随机森林不适用于稀疏输入，您的数据集形状对于一个大而太稀疏

您可能有一些具有大到大基数的分类功能（例如，一个自由的文本字段或唯一的入口ID）。尝试删除这些功能并重新开始。

I'm new to scikit-learn. I'm trying use preprocessing. OneHotEncoder to encode my training and test data. After encoding I tried to train Random forest classifier using that data. But I get the following error when fitting.(Here the error trace)

    99         model.fit(X_train, y_train)
    100         preds = model.predict_proba(X_cv)[:, 1]
    101 

C:\Python27\lib\site-packages\sklearn\ensemble\forest.pyc in fit(self, X, y, sample_weight)
    288 
    289         # Precompute some data
--> 290         X, y = check_arrays(X, y, sparse_format="dense")
    291         if (getattr(X, "dtype", None) != DTYPE or
    292                 X.ndim != 2 or

C:\Python27\lib\site-packages\sklearn\utils\validation.pyc in check_arrays(*arrays, **options)
    200                     array = array.tocsc()
    201                 elif sparse_format == 'dense':
--> 202                     raise TypeError('A sparse matrix was passed, but dense '
    203                                     'data is required. Use X.toarray() to '
    204                                     'convert to a dense numpy array.')

TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

I tried to convert the sparse matrix into dense using X.toarray() and X.todense()But when I do that, I get the following error trace.

 99         model.fit(X_train.toarray(), y_train)
    100         preds = model.predict_proba(X_cv)[:, 1]
    101 

C:\Python27\lib\site-packages\scipy\sparse\compressed.pyc in toarray(self)
    548 
    549     def toarray(self):
--> 550         return self.tocoo(copy=False).toarray()
    551 
    552     ##############################################################

C:\Python27\lib\site-packages\scipy\sparse\coo.pyc in toarray(self)
    236 
    237     def toarray(self):
--> 238         B = np.zeros(self.shape, dtype=self.dtype)
    239         M,N = self.shape
    240         coo_todense(M, N, self.nnz, self.row, self.col, self.data, B.ravel())

ValueError: array is too big.

Can anyone help me to fix this.

Thank you

解决方案

sklearn random forests do not work on sparse input and your dataset shape is to large and too sparse for a dense version to fit in memory.

You probably have some categorical features with a much to large cardinality (for instance a free text field or unique entry ids). Try to drop those features and start over.

这篇关于Sklearn - 不能在随机林分类器中使用编码数据的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！