问题描述
我正在尝试通过 2 个步骤创建 sklearn 管道:
I am trying to create an sklearn pipeline with 2 steps:
- 标准化数据
- 使用 KNN 拟合数据
但是,我的数据同时包含数字变量和分类变量,我已使用 pd.get_dummies
将它们转换为虚拟变量.我想对数字变量进行标准化,但让虚拟变量保持原样.我一直这样做:
However, my data has both numeric and categorical variables, which I have converted to dummies using pd.get_dummies
. I want to standardize the numeric variables but leave the dummies as they are. I have been doing this like this:
X = dataframe containing both numeric and categorical columns
numeric = [list of numeric column names]
categorical = [list of categorical column names]
scaler = StandardScaler()
X_numeric_std = pd.DataFrame(data=scaler.fit_transform(X[numeric]), columns=numeric)
X_std = pd.merge(X_numeric_std, X[categorical], left_index=True, right_index=True)
但是,如果我要创建一个像这样的管道:
However, if I were to create a pipeline like:
pipe = sklearn.pipeline.make_pipeline(StandardScaler(), KNeighborsClassifier())
它将标准化我的 DataFrame 中的所有列.有没有办法在仅标准化数字列的同时做到这一点?
It would standardize all of the columns in my DataFrame. Is there a way to do this while standardizing only the numeric columns?
推荐答案
UPD: 2021-05-10
UPD: 2021-05-10
对于 sklearn
>= 0.20,我们可以使用 sklearn.compose.ColumnTransformer
For sklearn
>= 0.20 we can use sklearn.compose.ColumnTransformer
这是一个 小例子:
导入和数据加载
# Author: Pedro Morales <part.morales@gmail.com>
#
# License: BSD 3 clause
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
np.random.seed(0)
# Load data from https://www.openml.org/d/40945
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
使用ColumnTransformer
的管道感知数据预处理:
pipeline-aware data preprocessing using ColumnTransformer
:
numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = OneHotEncoder(handle_unknown='ignore')
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])
分类
# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', LogisticRegression())])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=0)
clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))
旧答案:
假设您有以下 DF:
In [163]: df
Out[163]:
a b c d
0 aaa 1.01 xxx 111
1 bbb 2.02 yyy 222
2 ccc 3.03 zzz 333
In [164]: df.dtypes
Out[164]:
a object
b float64
c object
d int64
dtype: object
您可以找到所有数字列:
you can find all numeric columns:
In [165]: num_cols = df.columns[df.dtypes.apply(lambda c: np.issubdtype(c, np.number))]
In [166]: num_cols
Out[166]: Index(['b', 'd'], dtype='object')
In [167]: df[num_cols]
Out[167]:
b d
0 1.01 111
1 2.02 222
2 3.03 333
并将 StandardScaler
仅应用于那些数字列:
and apply StandardScaler
only to those numeric columns:
In [168]: scaler = StandardScaler()
In [169]: df[num_cols] = scaler.fit_transform(df[num_cols])
In [170]: df
Out[170]:
a b c d
0 aaa -1.224745 xxx -1.224745
1 bbb 0.000000 yyy 0.000000
2 ccc 1.224745 zzz 1.224745
现在您可以一次热编码"了分类(非数字)列...
now you can "one hot encode" categorical (non-numeric) columns...
这篇关于如何在 sklearn 管道中仅标准化数字变量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!