我正在尝试在python(sklearn版)中使用xgboost执行多类文本分类,但是有时它会错误地告诉我功能名称不匹配。奇怪的是,有时它确实可以工作(可能是四分之一),但是不确定性使我现在很难依靠该解决方案,即使该解决方案显示出令人鼓舞的结果,甚至没有做任何真正的事前准备。处理。

我在代码中提供了一些说明性的示例数据,这些数据与我将要使用的数据相似。我目前拥有的代码如下:

更新了反映maxymoo建议的代码

import xgboost as xgb
import numpy as np
from sklearn.cross_validation import KFold, train_test_split
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import CountVectorizer

rng = np.random.RandomState(31337)

y = np.array([0, 1, 2, 1, 0, 3, 1, 2, 3, 0])
X = np.array(['milk honey bear bear honey tigger',
          'tom jerry cartoon mouse cat cat WB',
          'peppa pig mommy daddy george peppa pig pig',
          'cartoon jerry tom silly',
          'bear honey hundred year woods',
          'ben holly elves fairies gaston fairy fairies castle king',
          'tom and jerry mouse WB',
          'peppa pig daddy pig rebecca rabit',
          'elves ben holly little kingdom king big people',
          'pot pot pot pot jar winnie pooh disney tigger bear'])

xgb_model = make_pipeline(CountVectorizer(), xgb.XGBClassifier())

kf = KFold(y.shape[0], n_folds=2, shuffle=True, random_state=rng)
for train_index, test_index in kf:
    xgb_model.fit(X[train_index],y[train_index])
    predictions = xgb_model.predict(X[test_index])
    actuals = y[test_index]
    accuracy = accuracy_score(actuals, predictions)
    print accuracy


我倾向于得到的错误如下:

Traceback (most recent call last):
  File "main.py", line 95, in <module>
    predictions = xgb_model.predict(X[test_index])
  File "//anaconda/lib/python2.7/site-packages/xgboost-0.6-py2.7.egg/xgboost/sklearn.py", line 465, in predict
    ntree_limit=ntree_limit)
  File "//anaconda/lib/python2.7/site-packages/xgboost-0.6-py2.7.egg/xgboost/core.py", line 939, in predict
    self._validate_features(data)
  File "//anaconda/lib/python2.7/site-packages/xgboost-0.6-py2.7.egg/xgboost/core.py", line 1179, in _validate_features
    data.feature_names))
ValueError: feature_names mismatch: ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19', 'f20', 'f21', 'f22', 'f23', 'f24', 'f25', 'f26'] ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19', 'f20', 'f21', 'f22', 'f23', 'f24']
expected f26, f25 in input data


任何指针将不胜感激!

最佳答案

您需要确保仅使用经过训练的功能对模型评分。通常的方法是使用Pipeline将矢量化程序和模型打包在一起。这样一来,他们将同时接受训练,并且如果测试数据中遇到新功能,矢量化程序只会忽略它(还要注意,您不需要在交叉测试的每个阶段都重新创建模型验证,您只需将其初始化一次,然后在每次折叠时对其进行调整):

from sklearn.pipeline import make_pipeline

xgb_model = make_pipeline(CountVectoriser(), xgb.XGBClassifier())
for train_index, test_index in kf:
    xgb_model.fit(X[train_index],y[train_index])
    predictions = xgb_model.predict(X[test_index])
    actuals = y[test_index]
    accuracy = accuracy_score(actuals, predictions)
    print accuracy

07-24 09:52