问题描述
我正在尝试对我的数据使用决策树分类器,该分类器看起来与本教程中的数据非常相似: https://www.ritchieng.com/machinelearning-one-hot-encoding/
I am trying to use a decision tree classier on my data which looks very similar to the data in this tutorial: https://www.ritchieng.com/machinelearning-one-hot-encoding/
然后,本教程继续将字符串转换为数字数据:
The tutorial then goes on convert the strings into numeric data:
X = pd.read_csv('titanic_data.csv')
X = X.select_dtypes(include=[object])
le = preprocessing.LabelEncoder()
X_2 = X.apply(le.fit_transform)
这使DataFrame看起来像这样:
This leaves the DataFrame looking like this:
此后,将数据通过OneHotEncoder放入,我认为可以将其拆分并相当容易地传递到分类树中.
After this, the data is put through the OneHotEncoder and I assume can then be split and passed into a decision tree classier fairly easily.
问题在于,在我看来,原始数字数据通过此编码过程变得很多.以后如何保存或添加在编码过程中删除的数字数据?谢谢!
The problem is that it appears to me that the original numeric data gets lots through this process of encoding. How can I keep or add in later the numeric data that was removed during the encoding process? Thanks!
推荐答案
实际上,有一个非常简单的解决方案-使用 pd.get_dummies()
Actually there is a really simple solution - using pd.get_dummies()
如果您具有如下数据框:
If you have a Data Frame like the following:
so_data = {
'passenger_id': [1,2,3,4,5],
'survived': [1,0,0,1,0],
'age': [24,25,68,39,5],
'sex': ['female', 'male', 'male', 'female', 'female'],
'first_name': ['Joanne', 'Mark', 'Josh', 'Petka', 'Ariel']
}
so_df = pd.DataFrame(so_data)
如下所示:
passenger_id survived age sex first_name
0 1 1 24 female Joanne
1 2 0 25 male Mark
2 3 0 68 male Josh
3 4 1 39 female Petka
4 5 0 5 female Ariel
您可以执行以下操作:
pd.get_dummies(so_df)
这将为您提供:
(对不起,截图,但是在SO上清理df确实很困难)
(sorry for the screenshot, but it's really difficult to clean the df on SO)
这篇关于Scikit学习中的一键式编码仅适用于部分DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!