本文介绍了multiprocessing.Pool() 比只使用普通函数慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

(这个问题是关于如何让 multiprocessing.Pool() 运行代码更快,我终于解决了,最终解决方案可以在帖子底部找到.)

(This question is about how to make multiprocessing.Pool() run code faster. I finally solved it, and the final solution can be found at the bottom of the post.)

原问题:

我正在尝试使用 Python 将一个单词与列表中的许多其他单词进行比较,并检索最相似的单词列表.为此,我使用了 difflib.get_close_matches 函数.我在一台相对较新且功能强大的 Windows 7 笔记本电脑上,使用 Python 2.6.5.

I'm trying to use Python to compare a word with many other words in a list and retrieve a list of the most similar ones. To do that I am using the difflib.get_close_matches function. I'm on a relatively new and powerful Windows 7 Laptop computer, with Python 2.6.5.

我想要的是加快比较过程,因为我的比较单词列表很长,我必须重复比较过程几次.当我听说多处理模块时,如果比较可以分解为工作任务并同时运行(从而利用机器功率来换取更快的速度),我的比较任务会更快地完成,这似乎是合乎逻辑的.

What I want is to speed up the comparison process because my comparison list of words is very long and I have to repeat the comparison process several times. When I heard about the multiprocessing module it seemed logical that if the comparison could be broken up into worker tasks and run simultaneously (and thus making use of machine power in exchange for faster speed) my comparison task would finish faster.

但是,即使尝试了许多不同的方法,并使用了文档中显示的方法和论坛帖子中建议的方法,Pool 方法似乎还是非常慢,比仅在一次完整的列表.我想帮助理解为什么 Pool() 这么慢以及我是否正确使用它.我只使用这个字符串比较场景作为例子,因为这是我能想到的最新例子,我无法理解或让多处理工作而不是反对我.下面是一个来自 difflib 场景的示例代码,显示了普通方法和池化方法之间的时间差异:

However, even after having tried many different ways, and used methods that have been shown in the docs and suggested in forum posts, the Pool method just seems to be incredibly slow, much slower than just running the original get_close_matches function on the entire list at once. I would like help understanding why Pool() is being so slow and if I am using it correctly. Im only using this string comparison scenario as an example because that is the most recent example I could think of where I was unable to understand or get multiprocessing to work for rather than against me. Below is just an example code from the difflib scenario showing the time differences between the ordinary and the Pooled methods:

from multiprocessing import Pool
import random, time, difflib

# constants
wordlist = ["".join([random.choice([letter for letter in "abcdefghijklmnopqersty"]) for lengthofword in xrange(5)]) for nrofwords in xrange(1000000)]
mainword = "hello"

# comparison function
def findclosematch(subwordlist):
    matches = difflib.get_close_matches(mainword,subwordlist,len(subwordlist),0.7)
    if matches <> []:
        return matches

# pool
print "pool method"
if __name__ == '__main__':
    pool = Pool(processes=3)
    t=time.time()
    result = pool.map_async(findclosematch, wordlist, chunksize=100)
    #do something with result
    for r in result.get():
        pass
    print time.time()-t

# normal
print "normal method"
t=time.time()
# run function
result = findclosematch(wordlist)
# do something with results
for r in result:
    pass
print time.time()-t

要找到的单词是hello",要在其中找到紧密匹配的单词列表是 100 万长列表,其中包含 5 个随机连接的字符(仅用于说明目的).我使用 3 个处理器内核和 map 函数,块大小为 100(我认为每个工人要处理的列表项??)(我也尝试了 1000 和 10 000 的块大小,但没有真正的区别).请注意,在这两种方法中,我在调用我的函数之前启动计时器,并在遍历结果后立即结束它.正如您在下面看到的,计时结果显然有利于原始的非 Pool 方法:

The word to be found is "hello", and the list of words in which to find close matches is a 1 million long list of 5 randomly joined characters (only for illustration purposes). I use 3 processor cores and the map function with a chunksize of 100 (listitems to be procesed per worker I think??) (I also tried chunksizes of 1000 and 10 000 but there was no real difference). Notice that in both methods I start the timer right before calling on my function and end it right after having looped through the results. As you can see below the timing results are clearly in favor of the original non-Pool method:

>>>
pool method
37.1690001488 seconds
normal method
10.5329999924 seconds
>>>

Pool 方法几乎比原始方法慢 4 倍.我在这里遗漏了什么,或者对池化/多处理的工作方式有误解吗?我确实怀疑这里的部分问题可能是 map 函数返回 None 并因此将数千个不必要的项目添加到结果列表中,即使我只希望将实际匹配项返回到结果中并将其写入函数中.据我了解,这就是地图的工作原理.我听说过一些其他的功能,比如 filter 只收集非 False 结果,但我不认为 multiprocessing/Pool 支持 filter 方法.除了多处理模块中的 map/imap 之外,还有其他函数可以帮助我只返回函数返回的内容吗?据我了解,Apply 函数更多地用于提供多个参数.

The Pool method is almost 4 times slower than the original method. Is there something I am missing here, or maybe misunderstanding about how the Pooling/multiprocessing works? I do suspect that part of the problem here could be that the map function returns None and so adds thousands of unneccessary items to the resultslist even though I only want actual matches to be returned to the results and have written it as such in the function. From what I understand that is just how map works. I have heard about some other functions like filter that only collects non-False results, but I dont think that multiprocessing/Pool supports the filter method. Are there any other functions besides map/imap in the multiprocessing module that could help me out in only returning what my function returns? Apply function is more for giving multiple arguments as I understand it.

我知道还有 imap 功能,我尝试过但没有任何时间改进.原因与我在理解 itertools 模块的优点时遇到问题的原因相同,据说是闪电般快",我注意到调用该函数是正确的,但根据我的经验和我所读到的因为调用函数实际上并不进行任何计算,所以当需要遍历结果以收集和分析它们时(没有它就没有调用函数的意义),它所花费的时间与只需使用普通版本的功能即可.但我想那是另一篇文章.

I know there's also the imap function, which I tried but without any time-improvements. The reason being the same reason why I have had problems understanding what's so great about the itertools module, supposedly "lightning fast", which I've noticed is true for calling the function, but in my experience and from what I've read that's because calling the function doesn't actually do any calculations, so when it's time to iterate through the results to collect and analyze them (without which there would be no point in calling the cuntion) it takes just as much or sometimes more time than a just using the normal version of the function straightup. But I suppose that's for another post.

无论如何,很高兴看到有人可以在这里推动我朝着正确的方向前进,并且非常感谢任何帮助.我对理解多处理一般比让这个例子工作更感兴趣,尽管它对一些示例解决方案代码建议很有用,以帮助我理解.

Anyway, excited to see if someone can nudge me in the right direction here, and really appreciate any help on this. I'm more interested in understanding multiprocessing in general than to get this example to work, though it would be useful with some example solution code suggestions to aid in my understanding.

答案:

似乎放缓与其他进程的启动时间缓慢有关.我无法让 .Pool() 函数足够快.我让它更快的最终解决方案是手动拆分工作负载列表,使用多个 .Process() 而不是 .Pool(),然后在队列中返回解决方案.但我想知道最关键的变化是否可能是根据要查找的主要词而不是要比较的词来划分工作量,也许是因为 difflib 搜索功能已经如此之快.这是同时运行 5 个进程的新代码,结果比运行简单代码快约 10 倍(6 秒对 55 秒).除了 difflib 已经有多快之外,对于快速模糊查找非常有用.

Seems like the slowdown had to do with the slow startup time of additional processes. I couldnt get the .Pool() function to be fast enough. My final solution to make it faster was to manually split the workload list, use multiple .Process() instead of .Pool(), and return the solutions in a Queue. But I wonder if maybe the most crucial change might have been splitting the workload in terms of the main word to look for rather than the words to compare with, perhaps because the difflib search function is already so fast. Here is the new code running 5 processes at the same time, and turned out about x10 faster than running a simple code (6 seconds vs 55 seconds). Very useful for fast fuzzy lookups, on top of how fast difflib already is.

from multiprocessing import Process, Queue
import difflib, random, time

def f2(wordlist, mainwordlist, q):
    for mainword in mainwordlist:
        matches = difflib.get_close_matches(mainword,wordlist,len(wordlist),0.7)
        q.put(matches)

if __name__ == '__main__':

    # constants (for 50 input words, find closest match in list of 100 000 comparison words)
    q = Queue()
    wordlist = ["".join([random.choice([letter for letter in "abcdefghijklmnopqersty"]) for lengthofword in xrange(5)]) for nrofwords in xrange(100000)]
    mainword = "hello"
    mainwordlist = [mainword for each in xrange(50)]

    # normal approach
    t = time.time()
    for mainword in mainwordlist:
        matches = difflib.get_close_matches(mainword,wordlist,len(wordlist),0.7)
        q.put(matches)
    print time.time()-t

    # split work into 5 or 10 processes
    processes = 5
    def splitlist(inlist, chunksize):
        return [inlist[x:x+chunksize] for x in xrange(0, len(inlist), chunksize)]
    print len(mainwordlist)/processes
    mainwordlistsplitted = splitlist(mainwordlist, len(mainwordlist)/processes)
    print "list ready"

    t = time.time()
    for submainwordlist in mainwordlistsplitted:
        print "sub"
        p = Process(target=f2, args=(wordlist,submainwordlist,q,))
        p.Daemon = True
        p.start()
    for submainwordlist in mainwordlistsplitted:
        p.join()
    print time.time()-t
    while True:
        print q.get()

推荐答案

我最好的猜测是进程间通信 (IPC) 开销.在单进程实例中,单进程有词表.在委派给其他各种进程时,主进程需要不断地将列表的各个部分穿梭给其他进程.

My best guess is inter-process communication (IPC) overhead. In the single-process instance, the single process has the word list. When delegating to various other processes, the main process needs to constantly shuttle sections of the list to other processes.

因此,更好的方法可能是分拆 n 个进程,每个进程负责加载/生成列表的 1/n 段和检查单词是否在列表的那部分.

Thus, it follows that a better approach might be to spin off n processes, each of which is responsible for loading/generating 1/n segment of the list and checking if the word is in that part of the list.

不过,我不确定如何使用 Python 的多处理库来做到这一点.

I'm not sure how to do that with Python's multiprocessing library, though.

这篇关于multiprocessing.Pool() 比只使用普通函数慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-04 18:31