从大桌子获取每个父母的最新孩子 - 查询太慢

本文介绍了从大桌子获取每个父母的最新孩子 - 查询太慢的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个由Django的ORM生成的查询，需要几个小时才能运行。

I have a query generated by Django's ORM, that is taking hours to run.

report_rank （5000万行）与 report_profile （100k行）有一对多的关系。我正在为每个 report_profile 检索最新的 report_rank 。

The report_rank table (50 million rows) is in a one to many relation to report_profile (100k rows). I'm trying to retrieve the latest report_rank for each report_profile.

我在一个超大型的Amazon EC2服务器上运行Postgres 9.1，并且拥有大量的可用RAM（使用2GB / 15GB）。磁盘IO当然是非常糟糕的。

I'm running Postgres 9.1 on an extra large Amazon EC2 server with plenty of available RAM (2GB/15GB used). Disk IO is pretty bad of course.

我在 report_rank.created 以及所有外键上都有索引字段。

I have indexes on report_rank.created as well as on all foreign key fields.

如何加快查询速度？我乐意尝试使用不同的方法查询，如果它将是性能的，或调整所需的任何数据库配置参数。

What can I do to speed this query up? I'd be happy to try a different approach with the query if it will be performant, or to tune any database configuration parameters needed.

EXPLAIN 
SELECT "report_rank"."id", "report_rank"."keyword_id", "report_rank"."site_id"
     , "report_rank"."rank", "report_rank"."url", "report_rank"."competition"
     , "report_rank"."source", "report_rank"."country", "report_rank"."created"
     , MAX(T7."created") AS "max" 
FROM "report_rank" 
LEFT OUTER JOIN "report_site" 
  ON ("report_rank"."site_id" = "report_site"."id") 
INNER JOIN "report_profile" 
  ON ("report_site"."id" = "report_profile"."site_id") 
INNER JOIN "crm_client" 
  ON ("report_profile"."client_id" = "crm_client"."id") 
INNER JOIN "auth_user" 
  ON ("crm_client"."user_id" = "auth_user"."id") 
LEFT OUTER JOIN "report_rank" T7 
  ON ("report_site"."id" = T7."site_id") 
WHERE ("auth_user"."is_active" = True  AND "crm_client"."is_deleted" = False ) 
GROUP BY "report_rank"."id", "report_rank"."keyword_id", "report_rank"."site_id"
     , "report_rank"."rank", "report_rank"."url", "report_rank"."competition"
     , "report_rank"."source", "report_rank"."country", "report_rank"."created" 
HAVING MAX(T7."created") =  "report_rank"."created";

EXPLAIN 的输出：

GroupAggregate  (cost=1136244292.46..1276589375.47 rows=48133327 width=72)
  Filter: (max(t7.created) = report_rank.created)
  ->  Sort  (cost=1136244292.46..1147889577.16 rows=4658113881 width=72)
        Sort Key: report_rank.id, report_rank.keyword_id, report_rank.site_id, report_rank.rank, report_rank.url, report_rank.competition, report_rank.source, report_rank.country, report_rank.created
        ->  Hash Join  (cost=1323766.36..6107863.59 rows=4658113881 width=72)
              Hash Cond: (report_rank.site_id = report_site.id)
              ->  Seq Scan on report_rank  (cost=0.00..1076119.27 rows=48133327 width=64)
              ->  Hash  (cost=1312601.51..1312601.51 rows=893188 width=16)
                    ->  Hash Right Join  (cost=47050.38..1312601.51 rows=893188 width=16)
                          Hash Cond: (t7.site_id = report_site.id)
                          ->  Seq Scan on report_rank t7  (cost=0.00..1076119.27 rows=48133327 width=12)
                          ->  Hash  (cost=46692.28..46692.28 rows=28648 width=8)
                                ->  Nested Loop  (cost=2201.98..46692.28 rows=28648 width=8)
                                      ->  Hash Join  (cost=2201.98..5733.23 rows=28648 width=4)
                                            Hash Cond: (crm_client.user_id = auth_user.id)
                                            ->  Hash Join  (cost=2040.73..5006.71 rows=44606 width=8)
                                                  Hash Cond: (report_profile.client_id = crm_client.id)
                                                  ->  Seq Scan on report_profile  (cost=0.00..1706.09 rows=93009 width=8)
                                                  ->  Hash  (cost=1761.98..1761.98 rows=22300 width=8)
                                                        ->  Seq Scan on crm_client  (cost=0.00..1761.98 rows=22300 width=8)
                                                              Filter: (NOT is_deleted)
                                            ->  Hash  (cost=126.85..126.85 rows=2752 width=4)
                                                  ->  Seq Scan on auth_user  (cost=0.00..126.85 rows=2752 width=4)
                                                        Filter: is_active
                                      ->  Index Scan using report_site_pkey on report_site  (cost=0.00..1.42 rows=1 width=4)
                                            Index Cond: (id = report_profile.site_id)

推荐答案

主要的一点很可能是你 JOIN 和 GROUP 超过一切只是为了获得 max（created）。分别获取此值。

The major point is most likely that you JOIN and GROUP over everything just to get max(created). Get this value separately.

您提到了所需的所有索引： report_rank.created 和外键。你在那里做得很好（如果您对好有兴趣，请继续阅读！）

You mentioned all the indexes that are needed here: on report_rank.created and on the foreign keys. You are doing alright there. (If you are interested in better than "alright", keep reading!)

LEFT JOIN report_site 将被 WHERE 子句强制为一个简单的 JOIN 。我替换了一个简单的 JOIN 。我也简化了你的语法。

The LEFT JOIN report_site will be forced to a plain JOIN by the WHERE clause. I substituted a plain JOIN. I also simplified your syntax a lot.

2015年7月更新更简单，更快速的查询和更智能的功能。

Updated July 2015 with simpler, faster queries and smarter functions.

report_rank.created 不唯一并且您想要全部最新的行。

使用窗口函数在子查询中。

report_rank.created is not unique and you want all the latest rows.
Using the window function rank() in a subquery.

SELECT r.id, r.keyword_id, r.site_id
     , r.rank, r.url, r.competition
     , r.source, r.country, r.created  -- same as "max"
FROM  (
   SELECT *, rank() OVER (ORDER BY created DESC NULLS LAST) AS rnk
   FROM   report_rank r
   WHERE  EXISTS (
      SELECT *
      FROM   report_site    s
      JOIN   report_profile p ON p.site_id = s.id
      JOIN   crm_client     c ON c.id      = p.client_id
      JOIN   auth_user      u ON u.id      = c.user_id
      WHERE  s.id = r.site_id
      AND    u.is_active
      AND    c.is_deleted = FALSE
      )
   ) sub
WHERE  rnk = 1;

为什么 DESC NULLS LAST ？

您可能已经注意到最后一个查询中已注释的部分：

You may have noticed the commented part in the last query:

AND  r.created > f_report_rank_cap()

你提到了50 mio。行，这很多。这是一种加快速度的方法：

You mentioned 50 mio. rows, that's a lot. Here is a way to speed things up:

创建一个简单的 IMMUTABLE 函数返回一个时间戳保证比年龄更大的兴趣同时年龄越小越好。

创建仅在较小的行上 - 基于此功能。

在与索引条件匹配的查询中使用 WHERE 条件。 / li>

创建另一个使用动态DDL将这些对象更新到最新行的函数。（减去安全保证金，以防最新行被删除/停用 - 如果可以发生）

在离线时调用此次要功能每个cronjob或按需要的最小并发活动。经常按照你想要的，不能伤害它，只需要一个简单的独家锁在桌子上。

Create a simple IMMUTABLE function returning a timestamp that's guaranteed to be older than rows of interest while being as young as possible.
Create a partial index on younger rows only - based on this function.
Use a WHERE condition in queries that matches the index condition.
Create another function that updates these objects to the latest row with dynamic DDL. (Minus a secure margin in case the newest row(s) get deleted / deactivated - if that can happen)
Invoke this secondary function at off-times with a minimum of concurrent activity per cronjob or on demand. As often as you want, can't do harm, it just needs a short exclusive lock on the table.

这是一个完成工作演示。

@erikcw，您必须按照下面的说明激活已注释的部分。

Here is a complete working demo.
@erikcw, you'll have to activate the commented part as instructed below.

CREATE TABLE report_rank(created timestamp);
INSERT INTO report_rank VALUES ('2011-11-11 11:11'),(now());

-- initial function
CREATE OR REPLACE FUNCTION f_report_rank_cap()
  RETURNS timestamp LANGUAGE sql COST 1 IMMUTABLE AS
$y$SELECT timestamp '-infinity'$y$;  -- or as high as you can safely bet.

-- initial index; 1st run indexes whole tbl if starting with '-infinity'
CREATE INDEX report_rank_recent_idx ON report_rank (created DESC NULLS LAST)
WHERE  created > f_report_rank_cap();

-- function to update function & reindex
CREATE OR REPLACE FUNCTION f_report_rank_set_cap()
  RETURNS void AS
$func$
DECLARE
   _secure_margin CONSTANT interval := interval '1 day';  -- adjust to your case
   _cap timestamp;  -- exclude older rows than this from partial index
BEGIN
   SELECT max(created) - _secure_margin
   FROM   report_rank
   WHERE  created > f_report_rank_cap() + _secure_margin
   /*  not needed for the demo; @erikcw needs to activate this
   AND    EXISTS (
     SELECT *
     FROM   report_site    s
     JOIN   report_profile p ON p.site_id = s.id
     JOIN   crm_client     c ON c.id      = p.client_id
     JOIN   auth_user      u ON u.id      = c.user_id
     WHERE  s.id = r.site_id
     AND    u.is_active
     AND    c.is_deleted = FALSE)
   */
   INTO   _cap;

   IF FOUND THEN
     -- recreate function
     EXECUTE format('
     CREATE OR REPLACE FUNCTION f_report_rank_cap()
       RETURNS timestamp LANGUAGE sql IMMUTABLE AS
     $y$SELECT %L::timestamp$y$', _cap);

     -- reindex
     REINDEX INDEX report_rank_recent_idx;
   END IF;
END
$func$  LANGUAGE plpgsql;

COMMENT ON FUNCTION f_report_rank_set_cap()
IS 'Dynamically recreate function f_report_rank_cap()
    and reindex partial index on report_rank.';

致电：

SELECT f_report_rank_set_cap();

请参阅：

SELECT f_report_rank_cap();

取消注释条款 AND r.created> f_report_rank_cap（）在上面的查询中观察差异。验证索引是否与 EXPLAIN ANALYZE 一起使用。

Uncomment the clause AND r.created > f_report_rank_cap() in the query above and observe the difference. Verify that the index gets used with EXPLAIN ANALYZE.

：

这篇关于从大桌子获取每个父母的最新孩子 - 查询太慢的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！