sklearn中带有数据标签的定制变压器Mixin

本文介绍了sklearn中带有数据标签的定制变压器Mixin的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在做一个小项目，尝试使用SMOTE综合少数族裔过采样技术"，我的数据不平衡..

I'm working on a small project where I'm trying to apply SMOTE "Synthetic Minority Over-sampling Technique", where my data is imbalanced ..

我为SMOTE功能创建了一个定制的TransformerMixin ..

I created a customized transformerMixin for the SMOTE function ..

class smote(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        print(X.shape, ' ', type(X)) # (57, 28)   <class 'numpy.ndarray'>
        print(len(y), ' ', type)     #    57      <class 'list'>
        smote = SMOTE(kind='regular', n_jobs=-1)
        X, y = smote.fit_sample(X, y)

        return X

    def transform(self, X):
        return X

model = Pipeline([
        ('posFeat1', featureVECTOR()),
        ('sca1', StandardScaler()),
        ('smote', smote()),
        ('classification', SGDClassifier(loss='hinge', max_iter=1, random_state = 38, tol = None))
    ])
    model.fit(train_df, train_df['label'].values.tolist())
    predicted = model.predict(test_df)

我在FIT功能上实现了SMOTE，因为我不希望将其应用于测试数据.

I implemented the SMOTE on the FIT function because I don't want it to be applied on the test data ..

不幸的是，我遇到了这个错误:

and unfortunately, I got this error:

     model.fit(train_df, train_df['label'].values.tolist())
  File "C:\Python35\lib\site-packages\sklearn\pipeline.py", line 248, in fit
    Xt, fit_params = self._fit(X, y, **fit_params)
  File "C:\Python35\lib\site-packages\sklearn\pipeline.py", line 213, in _fit
    **fit_params_steps[name])
  File "C:\Python35\lib\site-packages\sklearn\externals\joblib\memory.py", line 362, in __call__
    return self.func(*args, **kwargs)
  File "C:\Python35\lib\site-packages\sklearn\pipeline.py", line 581, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "C:\Python35\lib\site-packages\sklearn\base.py", line 520, in fit_transform
    return self.fit(X, y, **fit_params).transform(X)
AttributeError: 'numpy.ndarray' object has no attribute 'transform'

推荐答案

fit()方法应该返回self，而不是转换后的值.如果只需要火车数据功能而不需要测试，则可以实施fit_transform()方法.

fit() mehtod should return self, not the transformed values. If you need the functioning only for train data and not test, then implement the fit_transform() method.

class smote(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        print(X.shape, ' ', type(X)) # (57, 28)   <class 'numpy.ndarray'>
        print(len(y), ' ', type)     #    57      <class 'list'>
        self.smote = SMOTE(kind='regular', n_jobs=-1).fit(X, y)

        return self

    def fit_transform(self, X, y=None):
        self.fit(X, y)
        return self.smote.sample(X, y)

    def transform(self, X):
        return X

说明:在火车数据上(即调用pipeline.fit()时)，管道将首先尝试调用 fit_transform() 放在内部对象上.如果找不到，它将分别调用fit()和transform().

Explanation: On the train data (i.e. when pipeline.fit() is called) Pipeline will first try to call fit_transform() on the internal objects. If not found, then it will call fit() and transform() separately.

在测试数据上，每个内部对象仅调用transform()，因此此处提供的测试数据不应更改.

On the test data, only the transform() is called for each internal object, so here your supplied test data should not be changed.

更新:上面的代码仍然会引发错误.您会看到，当对提供的数据进行超采样时，X和y中的采样数都会改变.但是管道将仅对X数据起作用.不会更改y.因此，如果我纠正上述错误，您将得到关于标签不匹配样本的错误.如果偶然地生成的样本等于先前的样本，那么y值也将不与新的样本相对应.

Update: The above code will still throw error. You see, when you oversample the supplied data, the number of samples in X and y both change. But the pipeline will only work on the X data. It will not change the y. So either you will get error about unmatched samples to labels if I correct the above error. If by chance, the generated samples are equal to previous samples, then also the y values will not correspond to the new samples.

有效的解决方案:我很傻.

您可以只使用代替scikit-learn Pipeline的imblearn程序包.在管道上调用fit()时，它会自动处理re-sample，并且不会重新采样测试数据(当调用transform()或predict()时).

You can just use the Pipeline from the imblearn package in place of scikit-learn Pipeline. It takes care automatically to re-sample when called fit() on the pipeline, and does not re-sample test data (when called transform() or predict()).

实际上，我知道imblearn.Pipeline处理sample()方法，但是当您实现自定义类并说测试数据一定不能更改时被抛出.我没有想到那是默认的行为.

Actually I knew that imblearn.Pipeline handles sample() method, but was thrown off when you implemented a custom class and said that test data must not change. It did not come to my mind that thats the default behaviour.

只需替换

from sklearn.pipeline import Pipeline

与

from imblearn.pipeline import Pipeline

，您都准备好了.无需像您一样进行自定义类.只需使用原始的SMOTE.像这样:

and you are all set. No need to make a custom class as you did. Just use original SMOTE. Something like:

random_state = 38
model = Pipeline([
        ('posFeat1', featureVECTOR()),
        ('sca1', StandardScaler()),

        # Original SMOTE class
        ('smote', SMOTE(random_state=random_state)),
        ('classification', SGDClassifier(loss='hinge', max_iter=1, random_state=random_state, tol=None))
    ])

这篇关于sklearn中带有数据标签的定制变压器Mixin的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！