

我想我的调谐用网格搜索和火花交叉验证模式。在火花,必须把基础模型在管道中,管道的办公演示使用 LogistictRegression 作为一个基本模型,它可以是新的作为对象。但是,随机森林模型不能按客户code,因此它似乎无法使用随机森林在管道API。我不想重新创建轮,所以任何人可以给一些建议?


Well, that is true but you simply trying to use a wrong class. Instead of mllib.tree.RandomForest you should use ml.classification.RandomForestClassifier. Here is an example based on the one from MLlib docs.

import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.util.MLUtils
import sqlContext.implicits._

case class Record(category: String, features: Vector)

val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainData, testData) = (splits(0), splits(1))

val trainDF = trainData.map(lp => Record(lp.label.toString, lp.features)).toDF
val testDF = testData.map(lp => Record(lp.label.toString, lp.features)).toDF

val indexer = new StringIndexer()

val rf  = new RandomForestClassifier()

val pipeline = new Pipeline()
    .setStages(Array(indexer, rf))

val model = pipeline.fit(trainDF)


There is one thing I couldn't figure out here. As far as I can tell it should be possible to use labels extracted from LabeledPoints directly, but for some reason it doesn't work and pipeline.fit raises IllegalArgumentExcetion:

Hence the ugly trick with StringIndexer. After applying we get required attributes ({"vals":["1.0","0.0"],"type":"nominal","name":"label"}) but some classes in ml seem to work just fine without it.


08-13 18:50