如何在 sklearn 管道中仅标准化数字变量?

本文介绍了如何在 sklearn 管道中仅标准化数字变量?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试通过 2 个步骤创建 sklearn 管道:

I am trying to create an sklearn pipeline with 2 steps:

标准化数据
使用 KNN 拟合数据

但是，我的数据同时包含数字变量和分类变量，我已使用 pd.get_dummies 将它们转换为虚拟变量.我想对数字变量进行标准化，但让虚拟变量保持原样.我一直这样做:

However, my data has both numeric and categorical variables, which I have converted to dummies using pd.get_dummies. I want to standardize the numeric variables but leave the dummies as they are. I have been doing this like this:

X = dataframe containing both numeric and categorical columns
numeric = [list of numeric column names]
categorical = [list of categorical column names]
scaler = StandardScaler()
X_numeric_std = pd.DataFrame(data=scaler.fit_transform(X[numeric]), columns=numeric)
X_std = pd.merge(X_numeric_std, X[categorical], left_index=True, right_index=True)

但是，如果我要创建一个像这样的管道:

However, if I were to create a pipeline like:

pipe = sklearn.pipeline.make_pipeline(StandardScaler(), KNeighborsClassifier())

它将标准化我的 DataFrame 中的所有列.有没有办法在仅标准化数字列的同时做到这一点?

It would standardize all of the columns in my DataFrame. Is there a way to do this while standardizing only the numeric columns?

推荐答案

UPD: 2021-05-10

对于 sklearn >= 0.20，我们可以使用 sklearn.compose.ColumnTransformer

For sklearn >= 0.20 we can use sklearn.compose.ColumnTransformer

这是一个小例子:

导入和数据加载

# Author: Pedro Morales <part.morales@gmail.com>
#
# License: BSD 3 clause

import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV

np.random.seed(0)

# Load data from https://www.openml.org/d/40945
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)

使用ColumnTransformer的管道感知数据预处理:

pipeline-aware data preprocessing using ColumnTransformer:

numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

分类

# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression())])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                    random_state=0)

clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))

旧答案:

假设您有以下 DF:

In [163]: df
Out[163]:
     a     b    c    d
0  aaa  1.01  xxx  111
1  bbb  2.02  yyy  222
2  ccc  3.03  zzz  333

In [164]: df.dtypes
Out[164]:
a     object
b    float64
c     object
d      int64
dtype: object

您可以找到所有数字列:

you can find all numeric columns:

In [165]: num_cols = df.columns[df.dtypes.apply(lambda c: np.issubdtype(c, np.number))]

In [166]: num_cols
Out[166]: Index(['b', 'd'], dtype='object')

In [167]: df[num_cols]
Out[167]:
      b    d
0  1.01  111
1  2.02  222
2  3.03  333

并将 StandardScaler 仅应用于那些数字列:

and apply StandardScaler only to those numeric columns:

In [168]: scaler = StandardScaler()

In [169]: df[num_cols] = scaler.fit_transform(df[num_cols])

In [170]: df
Out[170]:
     a         b    c         d
0  aaa -1.224745  xxx -1.224745
1  bbb  0.000000  yyy  0.000000
2  ccc  1.224745  zzz  1.224745

现在您可以一次热编码"了分类(非数字)列...

now you can "one hot encode" categorical (non-numeric) columns...

这篇关于如何在 sklearn 管道中仅标准化数字变量?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！