想知道pd.factorize，pd.get_dummies，sklearn.preprocessing.LableEncoder和OneHotEncoder中的diff

本文介绍了想知道pd.factorize，pd.get_dummies，sklearn.preprocessing.LableEncoder和OneHotEncoder中的diff的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

所有这四个功能似乎与我相似。在某些情况下，其中一些可能会产生相同的结果，有些不是。任何帮助将非常感谢！

现在我知道，我认为在内部，因子分解和 LabelEncoder 的工作方式相同，在结果方面没有很大差异。我不知道他们是否会采用大量数据的类似时间。

get_dummies 和 OneHotEncoder 将产生相同的结果，但 OneHotEncoder 只能处理数字，但 get_dummies 将采取各种输入。 get_dummies 将为每个列输入自动生成新的列名，但 OneHotEncoder 不会（它会分配新的列名称1,2,3 ......）。所以 get_dummies 在所有方面都比较好。

如果我错了，请更正我！谢谢！

解决方案

这四个编码器可分为两类：

将标签编码为分类变量：熊猫 / code>。结果将有一个维度。

将分类变量编码为虚拟/指标（二进制）变量：熊猫$ code> get_dummies 和scikit学习 OneHotEncoder 。结果将具有n维，一个由编码分类变量的不同值组成。

大熊猫和scikit学习编码器之间的主要区别是使用scikit学习编码器在 scikit学习管道中使用 fit 和 transform 方法

将标签编入分类变量

熊猫 code>和scikit-learn LabelEncoder 属于第一类。它们可用于创建分类变量，例如将字符转换为数字。

 从sklearn导入预处理
＃测试data 
 df = DataFrame（['A'，'B'，'B'，'C']，columns = ['Col']）
 df ['Fact'] = pd.factorize df ['Col']）[0] 
 le = preprocessing.LabelEncoder（）
 df ['Lab'] = le.fit_transform（df ['Col']）
 
 print（df）
＃Col Fact Lab 
＃0 A 0 0 
＃1 B 1 1 
＃2 B 1 1 
＃3 C 2 2

将分类变量编码为虚拟/指标（二进制）变量

Pandas get_dummies 和scikit-learn OneHotEncoder 属于第二类。它们可用于创建二进制变量。 OneHotEncoder 只能与分类整数一起使用，而 get_dummies 可以与其他类型的变量一起使用。

  df = DataFrame（['A'，'B'，'B'，'C']，columns = ['Col']）
 df = pd.get_dummies（df）
 
 print（df）
＃Col_A Col_B Col_C 
＃0 1.0 0.0 0.0 
＃1 0.0 1.0 0.0 
＃2 0.0 1.0 0.0 
＃3 0.0 0.0 1.0 
 
从sklearn.preprocessing import OneHotEncoder，LabelEncoder 
 df = DataFrame（['A'，'B' 'B'，'C']，columns = ['Col']）
＃为了使用OneHotEncoder 
 le = preprocessing.LabelEncoder（）$ b $，我们需要将第一个字符转换为整数b df ['Col'] = le.fit_transform（df ['Col']）
 enc = OneHotEncoder（）
 df = DataFrame（enc.fit_transform（df）.toarray（））
 
 print（df）
＃0 1 2 
＃0 1.0 0.0 0.0 
＃1 0.0 1.0 0.0 
＃2 0.0 1.0 0.0 
＃3 0.0 0.0 1.0

All four functions seem really similar to me. In some situations some of them might give the same result, some not. Any help will be thankfully appreciated!

Now I know and I assume that internally, factorize and LabelEncoder work the same way and having no big differences in terms of results. I am not sure whether they will take up similar time with large magnitudes of data.

get_dummies and OneHotEncoder will yield the same result but OneHotEncoder can only handle numbers but get_dummies will take all kinds of input. get_dummies will generate new column names automatically for each column input, but OneHotEncoder will not (it rather will assign new column names 1,2,3....). So get_dummies is better in all respectives.

Please correct me if I am wrong! Thank you!

解决方案

These four encoders can be split in two categories:

Encode labels into categorical variables: Pandas factorize and scikit-learn LabelEncoder. The result will have 1 dimension.
Encode categorical variable into dummy/indicator (binary) variables: Pandas get_dummies and scikit-learn OneHotEncoder. The result will have n dimensions, one by distinct value of the encoded categorical variable.

The main difference between pandas and scikit-learn encoders is that scikit-learn encoders are made to be used in scikit-learn pipelines with fit and transform methods.

Encode labels into categorical variables

Pandas factorize and scikit-learn LabelEncoder belong to the first category. They can be used to create categorical variables for example to transform characters into numbers.

from sklearn import preprocessing    
# Test data
df = DataFrame(['A', 'B', 'B', 'C'], columns=['Col'])    
df['Fact'] = pd.factorize(df['Col'])[0]
le = preprocessing.LabelEncoder()
df['Lab'] = le.fit_transform(df['Col'])

print(df)
#   Col  Fact  Lab
# 0   A     0    0
# 1   B     1    1
# 2   B     1    1
# 3   C     2    2

Encode categorical variable into dummy/indicator (binary) variables

Pandas get_dummies and scikit-learn OneHotEncoder belong to the second category. They can be used to create binary variables. OneHotEncoder can only be used with categorical integers while get_dummies can be used with other type of variables.

df = DataFrame(['A', 'B', 'B', 'C'], columns=['Col'])
df = pd.get_dummies(df)

print(df)
#    Col_A  Col_B  Col_C
# 0    1.0    0.0    0.0
# 1    0.0    1.0    0.0
# 2    0.0    1.0    0.0
# 3    0.0    0.0    1.0

from sklearn.preprocessing import OneHotEncoder, LabelEncoder
df = DataFrame(['A', 'B', 'B', 'C'], columns=['Col'])
# We need to transform first character into integer in order to use the OneHotEncoder
le = preprocessing.LabelEncoder()
df['Col'] = le.fit_transform(df['Col'])
enc = OneHotEncoder()
df = DataFrame(enc.fit_transform(df).toarray())

print(df)
#      0    1    2
# 0  1.0  0.0  0.0
# 1  0.0  1.0  0.0
# 2  0.0  1.0  0.0
# 3  0.0  0.0  1.0

这篇关于想知道pd.factorize，pd.get_dummies，sklearn.preprocessing.LableEncoder和OneHotEncoder中的diff的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！