问题描述
说我具有以下输入功能:
Say I have the following input feature:
hotel_id = [1, 2, 3, 2, 3]
这是具有数字值的分类功能.如果按原样将其提供给模型,则模型会将其视为连续变量,即2>1.
This is a categorical feature with numeric values. If I give it to the model as it is, the model will treat it as continuous variable, ie., 2 > 1.
如果我应用 sklearn.labelEncoder()
,那么我会得到:
If I apply sklearn.labelEncoder()
then I will get:
hotel_id = [0, 1, 2, 1, 2]
因此,此编码功能被认为是连续的还是分类的?如果将其视为连续的,那么labelEncoder()的用途是什么.
So this encoded feature is considered as continuous or categorical?If it is treated as continuous then whats the use of labelEncoder().
P.S.我知道一种热编码.但是大约有100个hotel_id,所以不想使用它.谢谢
P.S. I know about one hot encoding. But there are around 100 hotel_ids so dont want to use it. Thanks
推荐答案
LabelEncoder
是一种编码类级别的方法.除了您提供的整数示例外,请考虑以下示例:
The LabelEncoder
is a way to encode class levels. In addition to the integer example you've included, consider the following example:
>>> from sklearn.preprocessing import LabelEncoder
>>> le = LabelEncoder()
>>>
>>> train = ["paris", "paris", "tokyo", "amsterdam"]
>>> test = ["tokyo", "tokyo", "paris"]
>>> le.fit(train).transform(test)
array([2, 2, 1]...)
LabelEncoder
允许我们做的是为分类数据分配序数级别.但是,您所注意到的是正确的:即,将 [2,2,1]
视为数字数据.对于将 OneHotEncoder
用于伪变量(我知道您说过您不希望使用此变量),这是一个不错的选择.
What the LabelEncoder
allows us to do, then, is to assign ordinal levels to categorical data. However, what you've noted is correct: namely, the [2, 2, 1]
is treated as numeric data. This is a good candidate for using the OneHotEncoder
for dummy variables (which I know you said you were hoping not to use).
请注意, LabelEncoder
必须在一次热编码之前使用,因为 OneHotEncoder
无法处理分类数据.因此,它经常被用作单热编码的前兆.
Note that the LabelEncoder
must be used prior to one-hot encoding, as the OneHotEncoder
cannot handle categorical data. Therefore, it is frequently used as pre-cursor to one-hot encoding.
或者,它可以将目标编码为可用数组.例如,如果 train
是分类的目标,则需要一个 LabelEncoder
并将其用作您的y变量.
Alternatively, it can encode your target into a usable array. If, for instance, train
were your target for classification, you would need a LabelEncoder
to use it as your y variable.
这篇关于sklearn中labelEncoder的工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!