本文介绍了scikit-learn中处理nan/null的分类器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道scikit-learn中是否有处理nan/null值的分类器.我以为随机森林回归器可以处理此问题,但是在调用predict时出现错误.

I was wondering if there are classifiers that handle nan/null values in scikit-learn. I thought random forest regressor handles this but I got an error when I call predict.

X_train = np.array([[1, np.nan, 3],[np.nan, 5, 6]])
y_train = np.array([1, 2])
clf = RandomForestRegressor(X_train, y_train)
X_test = np.array([7, 8, np.nan])
y_pred = clf.predict(X_test) # Fails!

我不能使用任何缺少值的scikit-learn算法调用预测吗?

Can I not call predict with any scikit-learn algorithm with missing values?

编辑.现在,我考虑了这一点,这是有道理的.在训练过程中这不是问题,但是当您预测变量为null时如何分支时呢?也许您可以同时拆分两种方法并取平均结果?只要距离函数忽略空值,似乎k-NN应该可以正常工作.

Edit.Now that I think about this, it makes sense. It's not an issue during training but when you predict how do you branch when the variable is null? maybe you could just split both ways and average the result? It seems like k-NN should work fine as long as the distance function ignores nulls though.

编辑2(更老更明智)某些gbm库(例如xgboost)正是出于此目的而使用三叉树而不是二叉树:2个孩子用于是/否"决定,1个孩子用于缺失的决定. sklearn是使用二叉树

Edit 2 (older and wiser me)Some gbm libraries (such as xgboost) use a ternary tree instead of a binary tree precisely for this purpose: 2 children for the yes/no decision and 1 child for the missing decision. sklearn is using a binary tree

推荐答案

我举了一个示例,其中包含训练中缺少的值和测试集

I made an example that contains both missing values in training and the test sets

我刚刚选择了一种使用SimpleImputer类用均值替换缺失数据的策略.还有其他策略.

I just picked a strategy to replace missing data with the mean, using the SimpleImputer class. There are other strategies.

from __future__ import print_function

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer


X_train = [[0, 0, np.nan], [np.nan, 1, 1]]
Y_train = [0, 1]
X_test_1 = [0, 0, np.nan]
X_test_2 = [0, np.nan, np.nan]
X_test_3 = [np.nan, 1, 1]

# Create our imputer to replace missing values with the mean e.g.
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp = imp.fit(X_train)

# Impute our data, then train
X_train_imp = imp.transform(X_train)
clf = RandomForestClassifier(n_estimators=10)
clf = clf.fit(X_train_imp, Y_train)

for X_test in [X_test_1, X_test_2, X_test_3]:
    # Impute each test item, then predict
    X_test_imp = imp.transform(X_test)
    print(X_test, '->', clf.predict(X_test_imp))

# Results
[0, 0, nan] -> [0]
[0, nan, nan] -> [0]
[nan, 1, 1] -> [1]

这篇关于scikit-learn中处理nan/null的分类器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-13 18:37