本文介绍了标记列,在范畴星火的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前使用StringIndexer到很多列转换成在RandomForestModel分类独特的整数。我还使用了ML过程中的管道。

有些查询是


  1. 如何在RandomForestModel知道哪些列分类。 StringIndexer转换非 - 数值对数值,但是它添加somesort的一些元数据以表明它是一个类别列?在mllib.tree.RF有参数调用categoricalInfo这表明它们是绝对列。 ml.tree.RF如何知道这是因为这不是present。


  2. 此外,StringIndexer映射类别的基础上OCCURENCES频率整数。现在,当新的数据进来,我该如何确保该数据连接codeD始终与训练数据?我可能坐Ø做,没有StringIndexing整个数据再次,包括新的数据?


我很困惑就如何落实这一点。


解决方案

Yes, it is possible. You just have to use an indexer fitted on a training data. If you use ML pipelines it will be handled for you just use StringIndexerModel directly:

import org.apache.spark.ml.feature.StringIndexer

val train = sc.parallelize(Seq((1, "a"), (2, "a"), (3, "b"))).toDF("x", "y")
val test  = sc.parallelize(Seq((1, "a"), (2, "b"), (3, "b"))).toDF("x", "y")

val indexer = new StringIndexer()
  .setInputCol("y")
  .setOutputCol("y_index")
  .fit(train)

indexer.transform(train).show

// +---+---+-------+
// |  x|  y|y_index|
// +---+---+-------+
// |  1|  a|    0.0|
// |  2|  a|    0.0|
// |  3|  b|    1.0|
// +---+---+-------+

indexer.transform(test).show

// +---+---+-------+
// |  x|  y|y_index|
// +---+---+-------+
// |  1|  a|    0.0|
// |  2|  b|    1.0|
// |  3|  b|    1.0|
// +---+---+-------+

One possible caveat is that it doesn't handle gracefully unseen labels so you have to drop these before transforming.

Different ML transformers add specialspecial metadata to the transformed columns which indicate type of the column, number of classes, etc.

import org.apache.spark.ml.attribute._
import org.apache.spark.ml.feature.VectorAssembler

val assembler = new VectorAssembler()
  .setInputCols(Array("x", "y_index"))
  .setOutputCol("features")

val transformed = assembler.transform(indexer.transform(train))
val meta = AttributeGroup.fromStructField(transformed.schema("features"))
meta.attributes.get

// Array[org.apache.spark.ml.attribute.Attribute] = Array(
//   {"type":"numeric","idx":0,"name":"x"},
//   {"vals":["a","b"],"type":"nominal","idx":1,"name":"y_index"})

or

transformed.select($"features").schema.fields.last.metadata
// "ml_attr":{"attrs":{"numeric":[{"idx":0,"name":"x"}], 
//  "nominal":[{"vals":["a","b"],"idx":1,"name":"y_index"}]},"num_attrs":2}}

这篇关于标记列,在范畴星火的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-27 22:37