问题描述
我正在尝试将行保留在包含缺失数据的数据集中.
I'm trying to keep rows in a dataset that contain missing data.
使用sklearn对一列(或多列)进行热编码时.可以写一个if currentItem == null
或if currentItem == 0
然后将输出数组设置为全0的规则吗?
When one-hot encoding a column (or multiple columns) with sklearn. Is it possible to write a rule that if currentItem == null
or if currentItem == 0
then set the output array to all 0s?
例如
A A B
-> [[1, 0], [1, 0], [0,1]]
B B A
-> [[0, 1], [0, 1], [1,0]]
null B A
-> [[0, 0], [0, 1], [1,0]]
单次编码:
import numpy as np
from sklearn.preprocessing import LabelEncoder
dataset = np.loadtxt("someFile.csv", delimiter=",")
B = dataset[:,1]
encoder = LabelEncoder()
encoder.fit(B)
encoded_B = encoder.transform(B)
Y = to_categorical(encoded_B)
编辑-数据集示例:其中A-E为输入而X& Y和输出
EDIT - Example Dataset:Where A-E are inputs and X & Y and outputs
A B C D E X Y
7 6 3 3 2 11 4
5 6 0 0 7 15 7
3 3 9 null 7 12 7
7 null 7 null 7 12 13
null 7 4 6 12 13 4
null 5 7 6 null 14 7
2 6 0 0 2 13 3
7 null 7 null 2 13 7
推荐答案
如果您有熊猫,这很简单.
If you have pandas, this is pretty simple.
s = pd.Series(['A', 'A', 0, 'B', 0, 'A', np.nan])
s
0 A
1 A
2 0
3 B
4 0
5 A
6 NaN
dtype: object
使用replace
将0
转换为NaN-
s = s.replace({0 : np.nan, '0' : np.nan})
s
0 A
1 A
2 NaN
3 B
4 NaN
5 A
6 NaN
dtype: object
现在,调用pd.get_dummies
,它将忽略NaN值.
Now, call pd.get_dummies
, which ignores NaN values.
pd.get_dummies(s)
A B
0 1 0
1 1 0
2 0 0
3 0 1
4 0 0
5 1 0
6 0 0
对于数据帧,解决方案是相同的.
The solution is the same for a dataframe.
这篇关于sklearn-一键式编码时如何合并丢失的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!