在Redshift和plan中使用非规范化结构是为了继续创建记录,并且在检索时只考虑针对用户的最新属性。
下表:

user_id   state  created_at
1         A      15-10-2015 02:00:00 AM
2         A      15-10-2015 02:00:01 AM
3         A      15-10-2015 02:00:02 AM
1         B      15-10-2015 02:00:03 AM
4         A      15-10-2015 02:00:04 AM
5         B      15-10-2015 02:00:05 AM

所需的结果集是:
user_id   state  created_at
2         A      15-10-2015 02:00:01 AM
3         A      15-10-2015 02:00:02 AM
4         A      15-10-2015 02:00:04 AM

我有检索上述结果的查询:
select user_id, first_value AS state
from (
   select user_id, first_value(state) OVER (
                     PARTITION BY user_id
                     ORDER BY created_at desc
                     ROWS between UNBOUNDED PRECEDING and CURRENT ROW)
   from customer_properties
   order by created_at) t
where first_value = 'A'

这是检索的最佳方法还是可以改进查询?

最佳答案

最佳查询取决于各种细节:查询谓词的选择性、基数、数据分布。如果state = 'A'是一个选择性条件(视图行符合条件),则此查询应该快得多:

SELECT c.user_id, c.state
FROM   customer_properties c
LEFT   JOIN customer_properties c1 ON c1.user_id = c.user_id
                                  AND c1.created_at > c.created_at
WHERE  c.state = 'A'
AND    c1.user_id IS NULL;

如果(state)(甚至(state, user_id, created_at))上有一个索引,而(user_id, created_at)上有另一个索引。
有多种方法来确保该行的后期版本不存在:
Select rows which are not present in other table
如果'A'state中的常用值,则此更通用的查询将更快:
SELECT user_id, state
FROM (
   SELECT user_id, state
        , row_number() OVER (PARTITION BY user_id ORDER BY created_at DESC) AS rn
   FROM   customer_properties
   ) t
WHERE  t.rn = 1
AND    t.state = 'A';

我删除了NULLS LAST,假设created_at定义为NOT NULL。另外,我不认为Redshift有:
PostgreSQL sort by datetime asc, null first?
两个查询都应该使用Redshift的有限功能。对于现代博士后,有更好的选择:
Select first row in each GROUP BY group?
Optimize GROUP BY query to retrieve latest record per user
如果最新行匹配,则原始行将返回每个user_id的所有行。你得把重复的,不必要的工作。。。

关于sql - 根据最新状态/属性值检索记录,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/33135937/

10-16 23:48