问题描述
我想我的调谐用网格搜索和火花交叉验证模式。在火花,必须把基础模型在管道中,管道的办公演示使用 LogistictRegression
作为一个基本模型,它可以是新的作为对象。但是,随机森林
模型不能新按客户code,因此它似乎无法使用随机森林
在管道API。我不想重新创建轮,所以任何人可以给一些建议?
谢谢
Well, that is true but you simply trying to use a wrong class. Instead of mllib.tree.RandomForest
you should use ml.classification.RandomForestClassifier
. Here is an example based on the one from MLlib docs.
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.util.MLUtils
import sqlContext.implicits._
case class Record(category: String, features: Vector)
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainData, testData) = (splits(0), splits(1))
val trainDF = trainData.map(lp => Record(lp.label.toString, lp.features)).toDF
val testDF = testData.map(lp => Record(lp.label.toString, lp.features)).toDF
val indexer = new StringIndexer()
.setInputCol("category")
.setOutputCol("label")
val rf = new RandomForestClassifier()
.setNumTrees(3)
.setFeatureSubsetStrategy("auto")
.setImpurity("gini")
.setMaxDepth(4)
.setMaxBins(32)
val pipeline = new Pipeline()
.setStages(Array(indexer, rf))
val model = pipeline.fit(trainDF)
model.transform(testDF)
There is one thing I couldn't figure out here. As far as I can tell it should be possible to use labels extracted from LabeledPoints
directly, but for some reason it doesn't work and pipeline.fit
raises IllegalArgumentExcetion
:
Hence the ugly trick with StringIndexer
. After applying we get required attributes ({"vals":["1.0","0.0"],"type":"nominal","name":"label"}
) but some classes in ml
seem to work just fine without it.
这篇关于如何使用随机森林在星火管道的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!