本文介绍了sklearn或python中更快的AUC的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有超过一百万对真正的标签和预测的分数(每个一维数组的长度不等,长度可能在10,000-30,000之间),我需要为此计算AUC。现在,我有一个for循环调用:

I have over half a million pairs of true labels and predicted scores (the length of each 1d array varies and can be between 10,000-30,000 in length) that I need to calculate the AUC for. Right now, I have a for-loop that calls:

# Simple Example with two pairs of true/predicted values instead of 500,000
from sklearn import metrics
import numpy as np

pred = [None] * 2
pred[0] = np.array([3,2,1])
pred[1] = np.array([15,12,14,11,13])

true = [None] * 2
true[0] = np.array([1,0,0])
true[1] = np.array([1,1,1,0,0])

for i in range(2):
    fpr, tpr, thresholds = metrics.roc_curve(true[i], pred[i])
    print metrics.auc(fpr, tpr)

但是,大约需要1-1.5个小时来处理整个数据集并为每个真实/预测对计算AUC。有没有更快/更好的方法呢?

However, it takes about 1-1.5 hours to process the entire dataset and calculate the AUC for each true/prediction pair. Is there a faster/better way to do this?

更新

每个500k条目中的一半可以具有形状(1、10k +)。我知道我可以并行化它,但是我被困在只有两个处理器的机器上,所以实际上我的时间只能有效地减少到30-45分钟,这仍然太长了。我已经确定AUC计算本身很慢,并且希望找到一种比sklearn更快的AUC算法。或者,至少,找到一种更好的方法来矢量化AUC计算,以便可以在多行中广播。

Each of the 500k entries can have shape (1, 10k+). I understand that I could parallelize it but I'm stuck on a machine with only two processors and so my time can really only be effectively cut down to say, 30-45, minutes which is still too long. I've identified that the AUC calculation itself is slow and was hoping to find a faster AUC algorithm than what is available in sklearn. Or, at least, find a better way to vectorize the AUC calculation so that it can be broadcasted across multiple rows.

推荐答案

独立(如果我了解您的设置),则应该可以使用,有效地并行化计算:

Since the calculation of each true/pred pair is independent (if I understood your setup), you should be able to reduce total processing time by using multiprocessing, effectively parallelizing the calculations:

import multiprocessing as mp

def roc(v):
    """ calculate one pair, return (index, auc) """
    i, true, pred = v
    fpr, tpr, thresholds = metrics.roc_curve(true, pred, drop_intermediate=True)
    auc = metrics.auc(fpr, tpr)
    return i, auc

pool = mp.Pool(3) 
result = pool.map_async(roc, ((i, true[i], pred[i]) for i in range(2)))
pool.close()
pool.join()
print result.get()
=>
[(0, 1.0), (1, 0.83333333333333326)]

此处 Pool(3)创建一个由3个进程组成的池, .map_async 映射所有真实/ pred对并调用 roc 函数,一次传递一对。索引被发送以映射回结果。

Here Pool(3) creates a pool of 3 processes, .map_async maps all true/pred pairs and calls the roc function, passing one pair at a time. The index is sent along to map back results.

如果true / pred对太大而无法序列化并发送给进程,则可能需要在调用<$ c $之前将数据写入某些外部数据结构中c> roc ,将其仅传递给引用 i 并读取每对数据 true [i] / pred [ i] ,然后从 roc 内进行处理。

If the true/pred pairs are too large to serialize and send to the processes, you might need to write the data into some external data structure before calling roc, passing it just the reference i and read the data for each pair true[i]/pred[i] from within roc before processing.

A 自动管理流程的调度。为减少内存占用的风险,您可能需要传递 Pool(....,maxtasksperchild = 1)参数,该参数将为每个true / pred对(选择其他合适的数字)。

A Pool automatically manages the scheduling of processes. To reduce the risk of a memory hog, you might need to pass the Pool(...., maxtasksperchild=1) parameter which would start a new process for each true/pred pair (choose any other number as you see fit).

更新

当然,这是一个限制因素。但是,考虑到仅以您真正需要的时间支付的非常合理的成本获得云计算资源,您可能需要在花费大量时间优化可以有效并行化的计算之前考虑硬件的替代方案。确实,这本身就是一种奢侈。

naturally this is a limiting factor. However considering the availability of cloud computing resources at very reasonable cost that you only pay for the time you actually need it, you might want to consider alternatives in hardware before you spend eons of hours optimizing a calculation that can be so effectively parallelized. That's a luxury in its own right, really.

这篇关于sklearn或python中更快的AUC的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-22 08:51