本文介绍了Spark 1.5.1,MLLib 随机森林概率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我将 Spark 1.5.1 与 MLLib 一起使用.我使用 MLLib 构建了一个随机森林模型,现在使用该模型进行预测.我可以使用 .predict 函数找到预测类别(0.0 或 1.0).但是,我找不到检索概率的函数(请参阅附加的屏幕截图).我认为 spark 1.5.1 随机森林会提供概率,我在这里遗漏了什么吗?

I am using Spark 1.5.1 with MLLib. I built a random forest model using MLLib, now use the model to do prediction. I can find the predict category (0.0 or 1.0) using the .predict function. However, I can't find the function to retrieve the probability (see the attached screenshot). I thought spark 1.5.1 random forest would provide the probability, am I missing anything here?

推荐答案

遗憾的是,该功能在较旧的 Spark MLlib 1.5.1 中不可用.

Unfortunately the feature is not available in the older Spark MLlib 1.5.1.

然而,您可以在 Spark MLlib 2.x 中最近的 Pipeline API 中以 RandomForestClassifier 的形式找到它:

You can however find it in the recent Pipeline API in Spark MLlib 2.x as RandomForestClassifier:

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}
import org.apache.spark.mllib.util.MLUtils

// Load and parse the data file, converting it to a DataFrame.
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt").toDF

// Index labels, adding metadata to the label column.
// Fit on whole dataset to include all labels in index.
val labelIndexer = new StringIndexer()
  .setInputCol("label")
  .setOutputCol("indexedLabel").fit(data)

// Automatically identify categorical features, and index them.
// Set maxCategories so features with > 4 distinct values are treated as continuous.
val featureIndexer = new VectorIndexer()
  .setInputCol("features")
  .setOutputCol("indexedFeatures")
  .setMaxCategories(4).fit(data)

// Split the data into training and test sets (30% held out for testing)
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))

// Train a RandomForest model.
val rf = new RandomForestClassifier()
  .setLabelCol(labelIndexer.getOutputCol)
  .setFeaturesCol(featureIndexer.getOutputCol)
  .setNumTrees(10)

// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
  .setInputCol("prediction")
  .setOutputCol("predictedLabel")
  .setLabels(labelIndexer.labels)

// Chain indexers and forest in a Pipeline
val pipeline = new Pipeline()
  .setStages(Array(labelIndexer, featureIndexer, rf, labelConverter))

// Fit model. This also runs the indexers.
val model = pipeline.fit(trainingData)

// Make predictions.
val predictions = model.transform(testData)
// predictions: org.apache.spark.sql.DataFrame = [label: double, features: vector, indexedLabel: double, indexedFeatures: vector, rawPrediction: vector, probability: vector, prediction: double, predictedLabel: string]

predictions.show(10)
// +-----+--------------------+------------+--------------------+-------------+-----------+----------+--------------+
// |label|            features|indexedLabel|     indexedFeatures|rawPrediction|probability|prediction|predictedLabel|
// +-----+--------------------+------------+--------------------+-------------+-----------+----------+--------------+
// |  0.0|(692,[124,125,126...|         1.0|(692,[124,125,126...|   [0.0,10.0]|  [0.0,1.0]|       1.0|           0.0|
// |  0.0|(692,[124,125,126...|         1.0|(692,[124,125,126...|    [1.0,9.0]|  [0.1,0.9]|       1.0|           0.0|
// |  0.0|(692,[129,130,131...|         1.0|(692,[129,130,131...|    [1.0,9.0]|  [0.1,0.9]|       1.0|           0.0|
// |  0.0|(692,[154,155,156...|         1.0|(692,[154,155,156...|    [1.0,9.0]|  [0.1,0.9]|       1.0|           0.0|
// |  0.0|(692,[154,155,156...|         1.0|(692,[154,155,156...|    [1.0,9.0]|  [0.1,0.9]|       1.0|           0.0|
// |  0.0|(692,[181,182,183...|         1.0|(692,[181,182,183...|    [1.0,9.0]|  [0.1,0.9]|       1.0|           0.0|
// |  1.0|(692,[99,100,101,...|         0.0|(692,[99,100,101,...|    [4.0,6.0]|  [0.4,0.6]|       1.0|           0.0|
// |  1.0|(692,[123,124,125...|         0.0|(692,[123,124,125...|   [10.0,0.0]|  [1.0,0.0]|       0.0|           1.0|
// |  1.0|(692,[124,125,126...|         0.0|(692,[124,125,126...|   [10.0,0.0]|  [1.0,0.0]|       0.0|           1.0|
// |  1.0|(692,[125,126,127...|         0.0|(692,[125,126,127...|   [10.0,0.0]|  [1.0,0.0]|       0.0|           1.0|
// +-----+--------------------+------------+--------------------+-------------+-----------+----------+--------------+
// only showing top 10 rows

注意:这个例子来自 Spark MLlib 的官方文档 ML - 随机森林分类器.

Note: This example is from the official documentation of Spark MLlib's ML - Random forest classifier.

这里是对一些输出列的一些解释:

And here is some explanation on some output columns :

  • predictionCol 代表预测的标签.
  • rawPredictionCol 表示长度为 # 个类别的向量,其中包含进行预测的树节点上的训练实例标签计数(仅适用于分类).
  • probabilityCol 表示长度 # 个类等于 rawPrediction 的概率向量,归一化为多项式分布(仅适用于分类).
  • predictionCol represents the predicted label .
  • rawPredictionCol represents a Vector of length # classes, with the counts of training instance labels at the tree node which makes the prediction (available for Classification only).
  • probabilityCol represents the probability Vector of length # classes equal to rawPrediction normalized to a multinomial distribution (available with Classification only).

这篇关于Spark 1.5.1,MLLib 随机森林概率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

05-23 03:38