PySpark &MLLib:随机森林预测的类别概率

本文介绍了PySpark &MLLib:随机森林预测的类别概率的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试提取我使用 PySpark 训练的随机森林对象的类概率.但是，我在文档的任何地方都没有看到它的示例，也不是 RandomForestModel 的方法.

I'm trying to extract the class probabilities of a random forest object I have trained using PySpark. However, I do not see an example of it anywhere in the documentation, nor is it a a method of RandomForestModel.

如何从 PySpark 中的 RandomForestModel 分类器中提取类概率?

How can I extract class probabilities from a RandomForestModel classifier in PySpark?

这里是文档中提供的示例代码，它只提供了最终的类(不是概率):

Here's the sample code provided in the documentation that only provides the final class (not the probability):

from pyspark.mllib.tree import RandomForest
from pyspark.mllib.util import MLUtils

# Load and parse the data file into an RDD of LabeledPoint.
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a RandomForest model.
#  Empty categoricalFeaturesInfo indicates all features are continuous.
#  Note: Use larger numTrees in practice.
#  Setting featureSubsetStrategy="auto" lets the algorithm choose.
model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
                                     numTrees=3, featureSubsetStrategy="auto",
                                     impurity='gini', maxDepth=4, maxBins=32)

# Evaluate model on test instances and compute test error
predictions = model.predict(testData.map(lambda x: x.features))

我没有看到任何 model.predict_proba() 方法——我该怎么办??

I don't see any model.predict_proba() method -- what should I do??

推荐答案

据我所知，当前版本 (1.2.1) 不支持此功能.原生 Scala 代码 (tree.py) 上的 Python 包装器仅定义预测"函数，这些函数反过来调用相应的 Scala 对应项 (treeEnsembleModels.scala).后者通过在二元决策中投票来做出决策.一个更简洁的解决方案是提供一个概率预测，可以任意设置阈值或用于 ROC 计算，如 sklearn.应该为将来的版本添加此功能！

As far as I can tell this is not supported in the current version (1.2.1). The Python wrapper over the native Scala code (tree.py) defines only 'predict' functions which, in turn, call the respective Scala counterparts (treeEnsembleModels.scala). The latter make decisions by taking a vote among binary decisions. A much cleaner solution would have been to provide a probabilistic prediction which can be thresholded arbitrarily or used for ROC computation like in sklearn. This feature should be added for future releases!

作为一种解决方法，我将 predict_proba 实现为纯 Python 函数(请参见下面的示例).它既不优雅也不高效，因为它在森林中的单个决策树集上运行循环.诀窍 - 或者更确切地说是一个肮脏的黑客 - 是访问 Java 决策树模型的数组并将它们转换为 Python 对应物.之后，您可以计算单个模型对整个数据集的预测，并使用zip"在 RDD 中累积它们的总和.除以树的数量得到想要的结果.对于大型数据集，主节点中少量决策树的循环应该是可以接受的.

As a workaround, I implemented predict_proba as a pure Python function (see example below). It is neither elegant nor very efficient, as it runs a loop over the set of individual decision trees in a forest. The trick - or rather a dirty hack - is to access the array of Java decision tree models and cast them into Python counterparts. After that you can compute individual model's predictions over the entire dataset and accumulate their sum in an RDD using 'zip'. Dividing by the number of trees gets the desired result. For large datasets, a loop over a small number of decision trees in a master node should be acceptable.

由于将 Python 集成到 Spark(在 Java 中运行)的困难，下面的代码相当棘手.应该非常小心，不要向工作节点发送任何复杂的数据，这会导致由于序列化问题而导致的崩溃.不能在工作节点上运行引用 Spark 上下文的代码.此外，引用任何 Java 代码的代码都不能被序列化.例如，在下面的代码中使用 len(trees) 而不是 ntrees 可能很诱人 - 砰！用 Java/Scala 编写这样的包装器可以更优雅，例如通过在工作节点上的决策树上运行循环，从而降低通信成本.

The code below is rather tricky due to the difficulties of integrating Python into Spark (run in Java). One should be very careful not to send any complex data to worker nodes, which results in crashes due to serialization problems. No code referring to the Spark context can be run on a worker node. Also, no code referring to any Java code can be serialized. For example, it may be tempting to use len(trees) instead of ntrees in the code below - bang! Writing such a wrapper in Java/Scala can be much more elegant, for example by running a loop over decision trees on worker nodes and hence reducing communication costs.

下面的测试函数演示了 predict_proba 给出了与原始示例中使用的 predict 相同的测试误差.

The test function below demonstrates that the predict_proba gives identical test error as predict used in original examples.

def predict_proba(rf_model, data):
   '''
   This wrapper overcomes the "binary" nature of predictions in the native
   RandomForestModel.
   '''

    # Collect the individual decision tree models by calling the underlying
    # Java model. These are returned as JavaArray defined by py4j.
    trees = rf_model._java_model.trees()
    ntrees = rf_model.numTrees()
    scores = DecisionTreeModel(trees[0]).predict(data.map(lambda x: x.features))

    # For each decision tree, apply its prediction to the entire dataset and
    # accumulate the results using 'zip'.
    for i in range(1,ntrees):
        dtm = DecisionTreeModel(trees[i])
        scores = scores.zip(dtm.predict(data.map(lambda x: x.features)))
        scores = scores.map(lambda x: x[0] + x[1])

    # Divide the accumulated scores over the number of trees
    return scores.map(lambda x: x/ntrees)

def testError(lap):
    testErr = lap.filter(lambda (v, p): v != p).count() / float(testData.count())
    print('Test Error = ' + str(testErr))


def testClassification(trainingData, testData):

    model = RandomForest.trainClassifier(trainingData, numClasses=2,
                                         categoricalFeaturesInfo={},
                                         numTrees=50, maxDepth=30)

    # Compute test error by thresholding probabilistic predictions
    threshold = 0.5
    scores = predict_proba(model,testData)
    pred = scores.map(lambda x: 0 if x < threshold else 1)
    lab_pred = testData.map(lambda lp: lp.label).zip(pred)
    testError(lab_pred)

    # Compute test error by comparing binary predictions
    predictions = model.predict(testData.map(lambda x: x.features))
    labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
    testError(labelsAndPredictions)

总而言之，这是一个很好的学习 Spark 的练习！

All-in-all, this was a nice exercise to learn Spark!

这篇关于PySpark &MLLib:随机森林预测的类别概率的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

to

PySpark &amp;MLLib:随机森林预测的类别概率

问题描述

推荐答案

PySpark &MLLib:随机森林预测的类别概率