本文介绍了如何在Spark ml中处理决策树,随机森林的分类特征?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在UCI银行营销数据上构建决策树和随机森林分类器-> https://archive.ics.uci.edu/ml/datasets/bank+marketing .数据集中有许多分类特征(具有字符串值).

I am trying to build decision tree and random forest classifier on the UCI bank marketing data -> https://archive.ics.uci.edu/ml/datasets/bank+marketing. There are many categorical features (having string values) in the data set.

在spark ml文档中,提到可以通过使用StringIndexer或VectorIndexer进行索引将分类变量转换为数字.我选择使用StringIndexer(向量索引需要向量特征和将特征转换为向量特征的向量汇编器仅接受数字类型).使用此方法,将基于分类特征的频率为其频率分配一个数值(对于类别特征的最常见标签,其值为0).

In the spark ml document, it's mentioned that the categorical variables can be converted to numeric by indexing using either StringIndexer or VectorIndexer. I chose to use StringIndexer (vector index requires vector feature and vector assembler which convert features to vector feature accepts only numeric type ). Using this approach, each of the level of a categorical feature will be assigned numeric value based on it's frequency (0 for most frequent label of a category feature).

我的问题是,随机森林或决策树的算法将如何理解新特征(源自分类特征)与连续变量不同.索引特征在算法中会被视为连续的吗?这是正确的方法吗?还是我应该继续对类别特征进行一键编码.

My question is how the algorithm of Random Forest or Decision Tree will understand that new features (derived from categorical features) are different than continuous variable. Will indexed feature be considered as continuous in the algorithm? Is it the right approach? Or should I go ahead with One-Hot-Encoding for categorical features.

我从该论坛上阅读了一些答案,但最后一部分并不清楚.

I read some of the answers from this forum but i didn't get clarity on the last part.

推荐答案

应对类别> 2的分类变量进行一次热编码.

要了解原因,您应该了解分类数据的子类别:Ordinal dataNominal data之间的区别.

To understand why, you should know the difference between the sub categories of categorical data: Ordinal data and Nominal data.

序数数据:这些值之间具有某种排序.例子:客户反馈(优秀,良好,中立,不良,非常差).如您所见,它们之间有明确的顺序(优秀>良好>中立>不良>非常不良).在这种情况下,仅StringIndexer就足以进行建模.

Ordinal Data: The values has some sort of ordering between them. example:Customer Feedback(excellent, good, neutral, bad, very bad). As you can see there is a clear ordering between them (excellent > good > neutral > bad > very bad). In this case StringIndexer alone is sufficient for modelling purpose.

标称数据:这些值之间没有定义的顺序.例如:颜色(黑色,蓝色,白色,...).在这种情况下,仅StringIndexer就足够了.并且String Indexing之后需要One Hot Encoding.

Nominal Data: The values has no defined ordering between them.example: colours(black, blue, white, ...). In this case StringIndexer alone is NOT sufficient. and One Hot Encoding is required after String Indexing.

String Indexing之后,假设输出为:

 id | colour   | categoryIndex
----|----------|---------------
 0  | black    | 0.0
 1  | white    | 1.0
 2  | yellow   | 2.0
 3  | red      | 3.0

然后,如果没有One Hot Encoding,则机器学习算法将假定:red > yellow > white > black,我们知道这是不正确的.OneHotEncoder()将帮助我们避免这种情况.

Then without One Hot Encoding, the machine learning algorithm will assume: red > yellow > white > black, which we know its not true.OneHotEncoder() will help us avoid this situation.

所以要回答您的问题

它将被视为连续变量.

It will be considered as continious variable.

取决于您对数据的理解.尽管随机森林和某些增强方法不需要OneHot Encoding,但大多数ML算法都需要它.

depends on your understanding of data.Although Random Forest and some boosting methods doesn't require OneHot Encoding, most ML algorithms need it.

引用: https://spark.apache.org/docs/Latest/ml-features.html#onehotencoder

这篇关于如何在Spark ml中处理决策树,随机森林的分类特征?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

05-23 03:03