Sklearn 标签编码多列 pandas 数据框

本文介绍了Sklearn 标签编码多列 pandas 数据框的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我尝试在大型 Pandas 数据框中对包含分类数据("Yes" 和 "No")的多个列进行编码.完整的数据帧包含 400 多列，因此我寻找一种方法来对所有所需的列进行编码，而不必逐一编码.我使用 Scikit-learn LabelEncoder 对分类数据进行编码.

I try to encode a number of columns containing categorical data ("Yes" and "No") in a large pandas dataframe. The complete dataframe contains over 400 columns so I look for a way to encode all desired columns without having to encode them one by one. I use Scikit-learn LabelEncoder to encode the categorical data.

不必对数据帧的第一部分进行编码，但是我正在寻找一种方法来直接对包含分类日期的所有所需列进行编码，而无需拆分和连接数据帧.

The first part of the dataframe does not have to be encoded, however I am looking for a method to encode all the desired columns containing categorical date directly without split and concatenate the dataframe.

为了演示我的问题，我首先尝试在数据框的一小部分上解决它.然而，在数据拟合和转换的最后部分卡住了，并得到一个 ValueError: bad input shape (4,3).我运行的代码:

To demonstrate my question I first tried to solve it on a small part of the dataframe. However get stuck at the last part where the data is fitted and transformed and get a ValueError: bad input shape (4,3). The code as I ran:

# Create a simple dataframe resembling large dataframe
    data = pd.DataFrame({'A': [1, 2, 3, 4],
                         'B': ["Yes", "No", "Yes", "Yes"],
                         'C': ["Yes", "No", "No", "Yes"],
                         'D': ["No", "Yes", "No", "Yes"]})


# Import required module
from sklearn.preprocessing import LabelEncoder

# Create an object of the label encoder class
labelencoder = LabelEncoder()

# Apply labelencoder object on columns
labelencoder.fit_transform(data.ix[:, 1:])   # First column does not need to be encoded

完整的错误报告:

labelencoder.fit_transform(data.ix[:, 1:])
Traceback (most recent call last):

  File "<ipython-input-47-b4986a719976>", line 1, in <module>
    labelencoder.fit_transform(data.ix[:, 1:])

  File "C:AnacondaAnaconda3libsite-packagessklearnpreprocessinglabel.py", line 129, in fit_transform
    y = column_or_1d(y, warn=True)

  File "C:AnacondaAnaconda3libsite-packagessklearnutilsvalidation.py", line 562, in column_or_1d
    raise ValueError("bad input shape {0}".format(shape))

ValueError: bad input shape (4, 3)

有人知道怎么做吗?

推荐答案

如以下代码，您可以通过将 LabelEncoder 应用于 DataFrame 来对多列进行编码.但是请注意，我们无法获取所有列的类信息.

As the following code, you can encode the multiple columns by applying LabelEncoder to DataFrame. However, please note that we cannot obtain the classes information for all columns.

import pandas as pd
from sklearn.preprocessing import LabelEncoder

df = pd.DataFrame({'A': [1, 2, 3, 4],
                   'B': ["Yes", "No", "Yes", "Yes"],
                   'C': ["Yes", "No", "No", "Yes"],
                   'D': ["No", "Yes", "No", "Yes"]})
print(df)
#    A    B    C    D
# 0  1  Yes  Yes   No
# 1  2   No   No  Yes
# 2  3  Yes   No   No
# 3  4  Yes  Yes  Yes

# LabelEncoder
le = LabelEncoder()

# apply "le.fit_transform"
df_encoded = df.apply(le.fit_transform)
print(df_encoded)
#    A  B  C  D
# 0  0  1  1  0
# 1  1  0  0  1
# 2  2  1  0  0
# 3  3  1  1  1

# Note: we cannot obtain the classes information for all columns.
print(le.classes_)
# ['No' 'Yes']

这篇关于Sklearn 标签编码多列 pandas 数据框的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！