问题描述
喜与一些分类的字符串值的数据框(例如UUID |网址|浏览器)。
Hi have a DataFrame with some categorical string values (e.g uuid|url|browser).
我想将其转换以双精度执行ML算法接受双矩阵。
I would to convert it in a double to execute an ML algorithm that accept double matrix.
由于皈依方法我用StringIndexer(火花1.4),我的字符串值映射到双精度值,所以我定义了这样的功能:
As convertion method I used StringIndexer (spark 1.4) that map my string values to double values, so I defined a function like this:
def str(arg: String, df:DataFrame) : DataFrame =
(
val indexer = new StringIndexer().setInputCol(arg).setOutputCol(arg+"_index")
val newDF = indexer.fit(df).transform(df)
return newDF
)
现在的问题是,我会遍历一个DF的的foreach列,调用这个函数,并添加(或转换)在解析双柱原始字符串列,因此结果将是:
Now the issue is that i would iterate foreach column of a df, call this function and add (or convert) the original string column in the parsed double column, so the result would be:
初始DF:
[String: uuid|String: url| String: browser]
最后DF:
[String: uuid|Double: uuid_index|String: url|Double: url_index|String: browser|Double: Browser_index]
在此先感谢
推荐答案
您可以简单地 foldLeft
在阵列
列:
val transformed: DataFrame = df.columns.foldLeft(df)((df, arg) => str(arg, df))
不过,我会说,这不是一个好办法。由于的src
丢弃 StringIndexerModel
当你得到新的数据将无法使用。正因为如此,我会建议使用
Still, I will argue that it is not a good approach. Since src
discards StringIndexerModel
it cannot be used when you get new data. Because of that I would recommend using Pipeline
:
import org.apache.spark.ml.Pipeline
val transformers: Array[org.apache.spark.ml.PipelineStage] = df.columns.map(
cname => new StringIndexer()
.setInputCol(cname)
.setOutputCol(s"${cname}_index")
)
// Add the rest of your pipeline like VectorAssembler and algorithm
val stages: Array[org.apache.spark.ml.PipelineStage] = transformers ++ ???
val pipeline = new Pipeline().setStages(stages)
val model = pipeline.fit(df)
修改
VectorAssembler
可以包含如下:
val assembler = new VectorAssembler()
.setInputCols(df.columns.map(cname => s"${cname}_index"))
.setOutputCol("features")
val stages = transformers :+ assembler
这篇关于如何进行矢量数据框列ML算法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!