本文介绍了想知道pd.factorize,pd.get_dummies,sklearn.preprocessing.LableEncoder和OneHotEncoder中的diff的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所有这四个功能似乎与我相似。在某些情况下,其中一些可能会产生相同的结果,有些不是。任何帮助将非常感谢!



现在我知道,我认为在内部,因子分解和 LabelEncoder 的工作方式相同,在结果方面没有很大差异。我不知道他们是否会采用大量数据的类似时间。



get_dummies 和 OneHotEncoder 将产生相同的结果,但 OneHotEncoder 只能处理数字,但 get_dummies 将采取各种输入。 get_dummies 将为每个列输入自动生成新的列名,但 OneHotEncoder 不会(它会分配新的列名称1,2,3 ......)。所以 get_dummies 在所有方面都比较好。



如果我错了,请更正我!谢谢!

解决方案

这四个编码器可分为两类:



标签编码为分类变量:熊猫 / code>。结果将有一个维度。
  • 分类变量编码为虚拟/指标(二进制)变量:熊猫$ code> get_dummies 和scikit学习 OneHotEncoder 。结果将具有n维,一个由编码分类变量的不同值组成。



  • 大熊猫和scikit学习编码器之间的主要区别是使用scikit学习编码器在 scikit学习管道中使用 fit 和 transform 方法



    将标签编入分类变量



    熊猫 code>和scikit-learn LabelEncoder 属于第一类。它们可用于创建分类变量,例如将字符转换为数字。

     从sklearn导入预处理
    #测试data
    df = DataFrame(['A','B','B','C'],columns = ['Col'])
    df ['Fact'] = pd.factorize df ['Col'])[0]
    le = preprocessing.LabelEncoder()
    df ['Lab'] = le.fit_transform(df ['Col'])

    print(df)
    #Col Fact Lab
    #0 A 0 0
    #1 B 1 1
    #2 B 1 1
    #3 C 2 2



    将分类变量编码为虚拟/指标(二进制)变量



    Pandas get_dummies 和scikit-learn OneHotEncoder 属于第二类。它们可用于创建二进制变量。 OneHotEncoder 只能与分类整数一起使用,而 get_dummies 可以与其他类型的变量一起使用。

      df = DataFrame(['A','B','B','C'],columns = ['Col'])
    df = pd.get_dummies(df)

    print(df)
    #Col_A Col_B Col_C
    #0 1.0 0.0 0.0
    #1 0.0 1.0 0.0
    #2 0.0 1.0 0.0
    #3 0.0 0.0 1.0

    从sklearn.preprocessing import OneHotEncoder,LabelEncoder
    df = DataFrame(['A','B' 'B','C'],columns = ['Col'])
    #为了使用OneHotEncoder
    le = preprocessing.LabelEncoder()$ b $,我们需要将第一个字符转换为整数b df ['Col'] = le.fit_transform(df ['Col'])
    enc = OneHotEncoder()
    df = DataFrame(enc.fit_transform(df).toarray())

    print(df)
    #0 1 2
    #0 1.0 0.0 0.0
    #1 0.0 1.0 0.0
    #2 0.0 1.0 0.0
    #3 0.0 0.0 1.0


    All four functions seem really similar to me. In some situations some of them might give the same result, some not. Any help will be thankfully appreciated!

    Now I know and I assume that internally, factorize and LabelEncoder work the same way and having no big differences in terms of results. I am not sure whether they will take up similar time with large magnitudes of data.

    get_dummies and OneHotEncoder will yield the same result but OneHotEncoder can only handle numbers but get_dummies will take all kinds of input. get_dummies will generate new column names automatically for each column input, but OneHotEncoder will not (it rather will assign new column names 1,2,3....). So get_dummies is better in all respectives.

    Please correct me if I am wrong! Thank you!

    解决方案

    These four encoders can be split in two categories:

    • Encode labels into categorical variables: Pandas factorize and scikit-learn LabelEncoder. The result will have 1 dimension.
    • Encode categorical variable into dummy/indicator (binary) variables: Pandas get_dummies and scikit-learn OneHotEncoder. The result will have n dimensions, one by distinct value of the encoded categorical variable.

    The main difference between pandas and scikit-learn encoders is that scikit-learn encoders are made to be used in scikit-learn pipelines with fit and transform methods.

    Encode labels into categorical variables

    Pandas factorize and scikit-learn LabelEncoder belong to the first category. They can be used to create categorical variables for example to transform characters into numbers.

    from sklearn import preprocessing    
    # Test data
    df = DataFrame(['A', 'B', 'B', 'C'], columns=['Col'])    
    df['Fact'] = pd.factorize(df['Col'])[0]
    le = preprocessing.LabelEncoder()
    df['Lab'] = le.fit_transform(df['Col'])
    
    print(df)
    #   Col  Fact  Lab
    # 0   A     0    0
    # 1   B     1    1
    # 2   B     1    1
    # 3   C     2    2
    

    Encode categorical variable into dummy/indicator (binary) variables

    Pandas get_dummies and scikit-learn OneHotEncoder belong to the second category. They can be used to create binary variables. OneHotEncoder can only be used with categorical integers while get_dummies can be used with other type of variables.

    df = DataFrame(['A', 'B', 'B', 'C'], columns=['Col'])
    df = pd.get_dummies(df)
    
    print(df)
    #    Col_A  Col_B  Col_C
    # 0    1.0    0.0    0.0
    # 1    0.0    1.0    0.0
    # 2    0.0    1.0    0.0
    # 3    0.0    0.0    1.0
    
    from sklearn.preprocessing import OneHotEncoder, LabelEncoder
    df = DataFrame(['A', 'B', 'B', 'C'], columns=['Col'])
    # We need to transform first character into integer in order to use the OneHotEncoder
    le = preprocessing.LabelEncoder()
    df['Col'] = le.fit_transform(df['Col'])
    enc = OneHotEncoder()
    df = DataFrame(enc.fit_transform(df).toarray())
    
    print(df)
    #      0    1    2
    # 0  1.0  0.0  0.0
    # 1  0.0  1.0  0.0
    # 2  0.0  1.0  0.0
    # 3  0.0  0.0  1.0
    

    这篇关于想知道pd.factorize,pd.get_dummies,sklearn.preprocessing.LableEncoder和OneHotEncoder中的diff的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

    10-22 08:45