调整python工作流程中的Postgresql性能和内存使用

本文介绍了调整python工作流程中的Postgresql性能和内存使用的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我将Postgresql 9.4用于模型数据库.我的桌子看起来像这样:

I use Postgresql 9.4 for a model database. My table looks somewhat like this:

CREATE TABLE table1 (
sid INTEGER PRIMARY KEY NOT NULL DEFAULT nextval('table1_sid_seq'::regclass),
col1 INT, 
col2 INT,
col3 JSONB);

我的Python 2.7工作流程通常如下所示:

My Python 2.7 workflow often looks like this:

curs.execute("SELECT sid, col1, col2 FROM table1")
data = curs.fetchall()
putback = []
for i in data: 
    result = do_something(i[1], i[2])
    putback.append((sid, result))
del data
curs.execute("UPDATE table1
              SET col3 = p.result
              FROM unnest(%s) p(sid INT, result JSONB)
              WHERE sid = p.sid", (putback,))

这通常效果很好且有效.但是，对于大型查询，在UPDATE命令期间，PostgreSQL的内存使用有时会经过屋顶(> 50GB)，并且我相信它会被OS X杀死，因为我得到了WARNING: terminating connection because of crash of another server process.我的Macbook Pro具有16GB的RAM，有问题的查询有1100万行，每行要写回约100个字符.

This typically works quite well and efficiently. However, for large queries Postgresql memory use will sometimes go through the roof (>50GB) during the UPDATE command and I believe it is being killed by OS X, because I get the WARNING: terminating connection because of crash of another server process. My Macbook Pro has 16GB of RAM and the query in question has 11M lines with each about 100 charactes of data to write back.

我的postgresql.conf:

default_statistics_target = 50
maintenance_work_mem = 512MB 
constraint_exclusion = on 
checkpoint_completion_target = 0.9
effective_cache_size = 4GB 
work_mem = 256MB 
wal_buffers = 16MB 
checkpoint_segments = 128 
shared_buffers = 1024MB 
max_connections = 80

所以我想知道

为什么我的查询有时会消耗过多的RAM?
如何控制内存使用并仍然保证良好的性能?
是否有用于调整Postgresql的良好指南或工具?

更新:
我很确定@wildplasser指出了我的问题.他在评论中建议先将数据转储到数据库中，然后从那里解压缩.不幸的是，我不知道如何执行他的建议. 如果有人知道如何做到这一点，他们的答案将很高兴被接受.

Update:
I am pretty sure that @wildplasser pinpointed my problem. In the comments he suggests to dump the data into the database first, and unpack it from there. Unfortunately I could not figure out how to implement his proposal. If anyone has an idea how to do that, their answer will be gladly accepted.

推荐答案

我的解决方法是使用建议的:

def chunk(l, n):
    n = max(1, n)
    return [l[i:i + n] for i in range(0, len(l), n)]

然后

for chunk in chunk(putback, 250000):
    curs.execute("UPDATE table1
                  SET col3 = p.result
                  FROM unnest(%s) p(sid INT, result JSONB)
                  WHERE sid = p.sid", (chunk,))

这很有效，即可以控制内存占用量，但是它并不像我通常那样一次转储所有数据那样优雅且慢.

This works, i.e. keeps the memory footprint in check, but is not very elegant and slower than dumping all data at once, as I usually do.

这篇关于调整python工作流程中的Postgresql性能和内存使用的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！