PySpark＆安培; MLLib：随机森林predictions等级概率

本文介绍了PySpark＆安培; MLLib：随机森林predictions等级概率的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图提取我一直在使用PySpark培养了随机森林对象的类概率。不过，我没有看到它的一个例子文档在任何地方，也不是一个方法 RandomForestModel 。

I'm trying to extract the class probabilities of a random forest object I have trained using PySpark. However, I do not see an example of it anywhere in the documentation, nor is it a a method of RandomForestModel.

我怎么能提取从PySpark一个 RandomForestModel 分级类的概率？

How can I extract class probabilities from a RandomForestModel classifier in PySpark?

下面是只提供了最后一节课（概率不大）的文档中提供的样本code：

Here's the sample code provided in the documentation that only provides the final class (not the probability):

from pyspark.mllib.tree import RandomForest
from pyspark.mllib.util import MLUtils

# Load and parse the data file into an RDD of LabeledPoint.
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a RandomForest model.
#  Empty categoricalFeaturesInfo indicates all features are continuous.
#  Note: Use larger numTrees in practice.
#  Setting featureSubsetStrategy="auto" lets the algorithm choose.
model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
                                     numTrees=3, featureSubsetStrategy="auto",
                                     impurity='gini', maxDepth=4, maxBins=32)

# Evaluate model on test instances and compute test error
predictions = model.predict(testData.map(lambda x: x.features))

我看不出有任何模型$ P $ pdict_proba（）法 - 我该怎么办？

推荐答案

据我可以告诉这在当前版本（1.2.1）不支持。 Python的包装在本地斯卡拉code（tree.py）只定义了'predict功能，这反过来，调用相应的斯卡拉同行（treeEnsembleModels.scala）。后者通过采取二元决策中投票作出决定。更简洁的解决办法是提供一个概率prediction可以任意阈值处理，或用于ROC计算像sklearn。此功能需增加未来的版本！

As far as I can tell this is not supported in the current version (1.2.1). The Python wrapper over the native Scala code (tree.py) defines only 'predict' functions which, in turn, call the respective Scala counterparts (treeEnsembleModels.scala). The latter make decisions by taking a vote among binary decisions. A much cleaner solution would have been to provide a probabilistic prediction which can be thresholded arbitrarily or used for ROC computation like in sklearn. This feature should be added for future releases!

作为一种变通方法，我实现了predict_proba作为一个纯Python函数（见下面的例子）。这既不优雅，也不非常有效，因为它运行在森林中的一组个人决策树的循环。诀窍 - 或者说一个肮脏的黑客 - 是访问Java决策树模型的数组扔Python的同行。之后，你可以计算单个模型的predictions在整个数据集，并使用'拉链'的累积总和的RDD。通过的树的数量除以得到所需的结果。对于大型数据集，一个循环遍历一个主节点少数决策树应该是可以接受的。

As a workaround, I implemented predict_proba as a pure Python function (see example below). It is neither elegant nor very efficient, as it runs a loop over the set of individual decision trees in a forest. The trick - or rather a dirty hack - is to access the array of Java decision tree models and cast them into Python counterparts. After that you can compute individual model's predictions over the entire dataset and accumulate their sum in an RDD using 'zip'. Dividing by the number of trees gets the desired result. For large datasets, a loop over a small number of decision trees in a master node should be acceptable.

下code是相当棘手，由于集成的Python到星火（在Java中运行）的困难。每个人都应该非常小心，不要任何复杂的数据发送到工作节点，从而导致因系列化问题崩溃。无code指的是火花上下文可以一个工作节点上运行。此外，没有code指的是任何Java code可序列化。例如，它可能是很有诱惑力的code以下使用LEN（树），而不是ntrees的 - 砰！使用Java语言编写/斯卡拉这样的包装可以通过在工作节点上运行的决策树一个循环，并因此降低沟通成本更加优雅，例如。

The code below is rather tricky due to the difficulties of integrating Python into Spark (run in Java). One should be very careful not to send any complex data to worker nodes, which results in crashes due to serialization problems. No code referring to the Spark context can be run on a worker node. Also, no code referring to any Java code can be serialized. For example, it may be tempting to use len(trees) instead of ntrees in the code below - bang! Writing such a wrapper in Java/Scala can be much more elegant, for example by running a loop over decision trees on worker nodes and hence reducing communication costs.

下面的测试功能证明了predict_proba给出相同的测试误差在原来的示例中使用predict。

The test function below demonstrates that the predict_proba gives identical test error as predict used in original examples.

def predict_proba(rf_model, data):
   '''
   This wrapper overcomes the "binary" nature of predictions in the native
   RandomForestModel.
   '''

    # Collect the individual decision tree models by calling the underlying
    # Java model. These are returned as JavaArray defined by py4j.
    trees = rf_model._java_model.trees()
    ntrees = rf_model.numTrees()
    scores = DecisionTreeModel(trees[0]).predict(data.map(lambda x: x.features))

    # For each decision tree, apply its prediction to the entire dataset and
    # accumulate the results using 'zip'.
    for i in range(1,ntrees):
        dtm = DecisionTreeModel(trees[i])
        scores = scores.zip(dtm.predict(data.map(lambda x: x.features)))
        scores = scores.map(lambda x: x[0] + x[1])

    # Divide the accumulated scores over the number of trees
    return scores.map(lambda x: x/ntrees)

def testError(lap):
    testErr = lap.filter(lambda (v, p): v != p).count() / float(testData.count())
    print('Test Error = ' + str(testErr))


def testClassification(trainingData, testData):

    model = RandomForest.trainClassifier(trainingData, numClasses=2,
                                         categoricalFeaturesInfo={},
                                         numTrees=50, maxDepth=30)

    # Compute test error by thresholding probabilistic predictions
    threshold = 0.5
    scores = predict_proba(model,testData)
    pred = scores.map(lambda x: 0 if x < threshold else 1)
    lab_pred = testData.map(lambda lp: lp.label).zip(pred)
    testError(lab_pred)

    # Compute test error by comparing binary predictions
    predictions = model.predict(testData.map(lambda x: x.features))
    labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
    testError(labelsAndPredictions)

ALL-IN-一切，这是一个很好的锻炼学习星火！

All-in-all, this was a nice exercise to learn Spark!

这篇关于PySpark＆安培; MLLib：随机森林predictions等级概率的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

Any