问题描述
我有一个由Django的ORM生成的查询,需要几个小时才能运行。
I have a query generated by Django's ORM, that is taking hours to run.
report_rank
(5000万行)与 report_profile
(100k行)有一对多的关系。我正在为每个 report_profile
检索最新的 report_rank
。
The report_rank
table (50 million rows) is in a one to many relation to report_profile
(100k rows). I'm trying to retrieve the latest report_rank
for each report_profile
.
我在一个超大型的Amazon EC2服务器上运行Postgres 9.1,并且拥有大量的可用RAM(使用2GB / 15GB)。磁盘IO当然是非常糟糕的。
I'm running Postgres 9.1 on an extra large Amazon EC2 server with plenty of available RAM (2GB/15GB used). Disk IO is pretty bad of course.
我在 report_rank.created
以及所有外键上都有索引字段。
I have indexes on report_rank.created
as well as on all foreign key fields.
如何加快查询速度?我乐意尝试使用不同的方法查询,如果它将是性能的,或调整所需的任何数据库配置参数。
What can I do to speed this query up? I'd be happy to try a different approach with the query if it will be performant, or to tune any database configuration parameters needed.
EXPLAIN
SELECT "report_rank"."id", "report_rank"."keyword_id", "report_rank"."site_id"
, "report_rank"."rank", "report_rank"."url", "report_rank"."competition"
, "report_rank"."source", "report_rank"."country", "report_rank"."created"
, MAX(T7."created") AS "max"
FROM "report_rank"
LEFT OUTER JOIN "report_site"
ON ("report_rank"."site_id" = "report_site"."id")
INNER JOIN "report_profile"
ON ("report_site"."id" = "report_profile"."site_id")
INNER JOIN "crm_client"
ON ("report_profile"."client_id" = "crm_client"."id")
INNER JOIN "auth_user"
ON ("crm_client"."user_id" = "auth_user"."id")
LEFT OUTER JOIN "report_rank" T7
ON ("report_site"."id" = T7."site_id")
WHERE ("auth_user"."is_active" = True AND "crm_client"."is_deleted" = False )
GROUP BY "report_rank"."id", "report_rank"."keyword_id", "report_rank"."site_id"
, "report_rank"."rank", "report_rank"."url", "report_rank"."competition"
, "report_rank"."source", "report_rank"."country", "report_rank"."created"
HAVING MAX(T7."created") = "report_rank"."created";
EXPLAIN
的输出:
GroupAggregate (cost=1136244292.46..1276589375.47 rows=48133327 width=72)
Filter: (max(t7.created) = report_rank.created)
-> Sort (cost=1136244292.46..1147889577.16 rows=4658113881 width=72)
Sort Key: report_rank.id, report_rank.keyword_id, report_rank.site_id, report_rank.rank, report_rank.url, report_rank.competition, report_rank.source, report_rank.country, report_rank.created
-> Hash Join (cost=1323766.36..6107863.59 rows=4658113881 width=72)
Hash Cond: (report_rank.site_id = report_site.id)
-> Seq Scan on report_rank (cost=0.00..1076119.27 rows=48133327 width=64)
-> Hash (cost=1312601.51..1312601.51 rows=893188 width=16)
-> Hash Right Join (cost=47050.38..1312601.51 rows=893188 width=16)
Hash Cond: (t7.site_id = report_site.id)
-> Seq Scan on report_rank t7 (cost=0.00..1076119.27 rows=48133327 width=12)
-> Hash (cost=46692.28..46692.28 rows=28648 width=8)
-> Nested Loop (cost=2201.98..46692.28 rows=28648 width=8)
-> Hash Join (cost=2201.98..5733.23 rows=28648 width=4)
Hash Cond: (crm_client.user_id = auth_user.id)
-> Hash Join (cost=2040.73..5006.71 rows=44606 width=8)
Hash Cond: (report_profile.client_id = crm_client.id)
-> Seq Scan on report_profile (cost=0.00..1706.09 rows=93009 width=8)
-> Hash (cost=1761.98..1761.98 rows=22300 width=8)
-> Seq Scan on crm_client (cost=0.00..1761.98 rows=22300 width=8)
Filter: (NOT is_deleted)
-> Hash (cost=126.85..126.85 rows=2752 width=4)
-> Seq Scan on auth_user (cost=0.00..126.85 rows=2752 width=4)
Filter: is_active
-> Index Scan using report_site_pkey on report_site (cost=0.00..1.42 rows=1 width=4)
Index Cond: (id = report_profile.site_id)
推荐答案
主要的一点很可能是你 JOIN
和 GROUP
超过一切只是为了获得 max(created)
。分别获取此值。
The major point is most likely that you JOIN
and GROUP
over everything just to get max(created)
. Get this value separately.
您提到了所需的所有索引: report_rank.created
和外键。你在那里做得很好(如果您对好有兴趣,请继续阅读!)
You mentioned all the indexes that are needed here: on report_rank.created
and on the foreign keys. You are doing alright there. (If you are interested in better than "alright", keep reading!)
LEFT JOIN report_site
将被 WHERE
子句强制为一个简单的 JOIN
。我替换了一个简单的 JOIN
。我也简化了你的语法。
The LEFT JOIN report_site
will be forced to a plain JOIN
by the WHERE
clause. I substituted a plain JOIN
. I also simplified your syntax a lot.
2015年7月更新更简单,更快速的查询和更智能的功能。
Updated July 2015 with simpler, faster queries and smarter functions.
report_rank.created
不唯一并且您想要全部最新的行。
使用窗口函数在子查询中。
report_rank.created
is not unique and you want all the latest rows.
Using the window function rank()
in a subquery.
SELECT r.id, r.keyword_id, r.site_id
, r.rank, r.url, r.competition
, r.source, r.country, r.created -- same as "max"
FROM (
SELECT *, rank() OVER (ORDER BY created DESC NULLS LAST) AS rnk
FROM report_rank r
WHERE EXISTS (
SELECT *
FROM report_site s
JOIN report_profile p ON p.site_id = s.id
JOIN crm_client c ON c.id = p.client_id
JOIN auth_user u ON u.id = c.user_id
WHERE s.id = r.site_id
AND u.is_active
AND c.is_deleted = FALSE
)
) sub
WHERE rnk = 1;
为什么 DESC NULLS LAST
?
您可能已经注意到最后一个查询中已注释的部分:
You may have noticed the commented part in the last query:
AND r.created > f_report_rank_cap()
你提到了50 mio。行,这很多。这是一种加快速度的方法:
You mentioned 50 mio. rows, that's a lot. Here is a way to speed things up:
- 创建一个简单的
IMMUTABLE
函数返回一个时间戳保证比年龄更大的兴趣同时年龄越小越好。 - 创建仅在较小的行上 - 基于此功能。
- 在与索引条件匹配的查询中使用
WHERE
条件。 / li>
- 创建另一个使用动态DDL将这些对象更新到最新行的函数。 (减去安全保证金,以防最新行被删除/停用 - 如果可以发生)
- 在离线时调用此次要功能每个cronjob或按需要的最小并发活动。经常按照你想要的,不能伤害它,只需要一个简单的独家锁在桌子上。
- Create a simple
IMMUTABLE
function returning a timestamp that's guaranteed to be older than rows of interest while being as young as possible. - Create a partial index on younger rows only - based on this function.
- Use a
WHERE
condition in queries that matches the index condition. - Create another function that updates these objects to the latest row with dynamic DDL. (Minus a secure margin in case the newest row(s) get deleted / deactivated - if that can happen)
- Invoke this secondary function at off-times with a minimum of concurrent activity per cronjob or on demand. As often as you want, can't do harm, it just needs a short exclusive lock on the table.
这是一个完成工作演示。
@erikcw,您必须按照下面的说明激活已注释的部分。
Here is a complete working demo.
@erikcw, you'll have to activate the commented part as instructed below.
CREATE TABLE report_rank(created timestamp);
INSERT INTO report_rank VALUES ('2011-11-11 11:11'),(now());
-- initial function
CREATE OR REPLACE FUNCTION f_report_rank_cap()
RETURNS timestamp LANGUAGE sql COST 1 IMMUTABLE AS
$y$SELECT timestamp '-infinity'$y$; -- or as high as you can safely bet.
-- initial index; 1st run indexes whole tbl if starting with '-infinity'
CREATE INDEX report_rank_recent_idx ON report_rank (created DESC NULLS LAST)
WHERE created > f_report_rank_cap();
-- function to update function & reindex
CREATE OR REPLACE FUNCTION f_report_rank_set_cap()
RETURNS void AS
$func$
DECLARE
_secure_margin CONSTANT interval := interval '1 day'; -- adjust to your case
_cap timestamp; -- exclude older rows than this from partial index
BEGIN
SELECT max(created) - _secure_margin
FROM report_rank
WHERE created > f_report_rank_cap() + _secure_margin
/* not needed for the demo; @erikcw needs to activate this
AND EXISTS (
SELECT *
FROM report_site s
JOIN report_profile p ON p.site_id = s.id
JOIN crm_client c ON c.id = p.client_id
JOIN auth_user u ON u.id = c.user_id
WHERE s.id = r.site_id
AND u.is_active
AND c.is_deleted = FALSE)
*/
INTO _cap;
IF FOUND THEN
-- recreate function
EXECUTE format('
CREATE OR REPLACE FUNCTION f_report_rank_cap()
RETURNS timestamp LANGUAGE sql IMMUTABLE AS
$y$SELECT %L::timestamp$y$', _cap);
-- reindex
REINDEX INDEX report_rank_recent_idx;
END IF;
END
$func$ LANGUAGE plpgsql;
COMMENT ON FUNCTION f_report_rank_set_cap()
IS 'Dynamically recreate function f_report_rank_cap()
and reindex partial index on report_rank.';
致电:
SELECT f_report_rank_set_cap();
请参阅:
SELECT f_report_rank_cap();
取消注释条款 AND r.created> f_report_rank_cap()
在上面的查询中观察差异。验证索引是否与 EXPLAIN ANALYZE
一起使用。
Uncomment the clause AND r.created > f_report_rank_cap()
in the query above and observe the difference. Verify that the index gets used with EXPLAIN ANALYZE
.
:
这篇关于从大桌子获取每个父母的最新孩子 - 查询太慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!