Scikit学习中的一键式编码仅适用于部分DataFrame

本文介绍了Scikit学习中的一键式编码仅适用于部分DataFrame的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试对我的数据使用决策树分类器，该分类器看起来与本教程中的数据非常相似: https://www.ritchieng.com/machinelearning-one-hot-encoding/

I am trying to use a decision tree classier on my data which looks very similar to the data in this tutorial: https://www.ritchieng.com/machinelearning-one-hot-encoding/

然后，本教程继续将字符串转换为数字数据:

The tutorial then goes on convert the strings into numeric data:

X = pd.read_csv('titanic_data.csv')
X = X.select_dtypes(include=[object])
le = preprocessing.LabelEncoder()
X_2 = X.apply(le.fit_transform)

这使DataFrame看起来像这样:

This leaves the DataFrame looking like this:

此后，将数据通过OneHotEncoder放入，我认为可以将其拆分并相当容易地传递到分类树中.

After this, the data is put through the OneHotEncoder and I assume can then be split and passed into a decision tree classier fairly easily.

问题在于，在我看来，原始数字数据通过此编码过程变得很多.以后如何保存或添加在编码过程中删除的数字数据?谢谢！

The problem is that it appears to me that the original numeric data gets lots through this process of encoding. How can I keep or add in later the numeric data that was removed during the encoding process? Thanks!

推荐答案

实际上，有一个非常简单的解决方案-使用 pd.get_dummies()

Actually there is a really simple solution - using pd.get_dummies()

如果您具有如下数据框:

If you have a Data Frame like the following:

so_data = {
    'passenger_id': [1,2,3,4,5],
    'survived': [1,0,0,1,0],
    'age': [24,25,68,39,5],
    'sex': ['female', 'male', 'male', 'female', 'female'],
    'first_name': ['Joanne', 'Mark', 'Josh', 'Petka', 'Ariel']
}
so_df = pd.DataFrame(so_data)

如下所示:

    passenger_id    survived    age   sex       first_name
0              1           1    24  female        Joanne
1              2           0    25  male          Mark
2              3           0    68  male          Josh
3              4           1    39  female        Petka
4              5           0    5   female        Ariel

您可以执行以下操作:

pd.get_dummies(so_df)

这将为您提供:

(对不起，截图，但是在SO上清理df确实很困难)

(sorry for the screenshot, but it's really difficult to clean the df on SO)

这篇关于Scikit学习中的一键式编码仅适用于部分DataFrame的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！