问题描述
在Impala表中具有类型2维度,该表具有〜102列的约500M行:(C1,C2,...,C8,... C100,Eff_DT,EXP_DT)只需要选择具有不同组合值(C1,C2,... C8)的行.对于每个选定的记录,EFF_DT和EXP_DT必须分别是该记录所属组的min(eff_dt)和max(eff_dt)(此处的组由不同的组合(C1,C2,..,C8定义)
Have a Type 2 Dimension residing in an Impala table with ~500M rows having 102 columns : ( C1, C2, ..., C8,...C100, Eff_DT, EXP_DT)Need to select only the rows that have distinct combination value of (C1,C2,..,C8).For each selected record, the EFF_DT and EXP_DT must be respectively the min(eff_dt) and max(eff_dt) of the group to which that record belongs ( a group here is defined by a distinct combination (C1,C2,..,C8)
简单的分组依据不会解决这里的问题,因为它将忽略同一组的时间延迟...
A simple Group By will not solve the problem here because it will omit the time lags for the same group ...
为简单起见,这是必需的,也是我尝试过的-假设只有2列定义了一个组(而不是8)这是一个输入,所需输出和仅使用group by ...
For the sake of simplicity, here is what is required and what i have tried - assuming that only 2 columns are defining a group (not 8 )Here is an example of input, desired output and output by using only group by ...
--INPUT --DESIRED OUTPUT --OUTPUT of SIMPLE GROUP BY
------------------------------------------------------------------------------------------------------------
C1 C2 EFF_DT EXP_DT C1 C2 Eff_dt EXP_DT C1 C2 EFF_DT EXP_DT
4 8 2013-11-30 2014-01-22 4 8 2013-11-30 2014-01-22 4 8 2013-11-30 2999-12-31
2 8 2014-01-23 2014-01-23 2 8 2014-01-23 2014-01-23 2 8 2014-01-23 2014-01-23
4 8 2014-01-24 2015-12-31 4 8 2014-01-24 2999-12-31
4 8 2016-01-01 2016-12-31
4 8 2017-01-01 2018-03-15
4 8 2018-03-16 2018-07-24
4 8 2018-07-25 2999-12-31
试图在select语句中使用子查询来基于当前行选择max(exp_dt),但由于impala不支持而无法工作...
Tried to use a subquery inside select statement to select max(exp_dt) based on current row but didnt work as impala does not support that .....
这是我尝试过的查询,它运行正常,但在Impala中却无法运行(因为在select语句中不支持子查询
Here is the query i tried , which is working fine but not in Impala (Because subqueries are not supported inside select statements
SELECT
T0.C1,
T0.C2,
MIN(T0.EFF_DT) AS MIN_EFF_DT,
T0.EXP_DT
FROM (
SELECT
T1.C1,
T1.C2,
T1.EFF_DT,
(
SELECT MAX(T2.EXP_DT)
FROM (select * from TABLE_NAME ) T2
WHERE T2.C1 = T1.C1
AND T2.C2 = T1.C2
AND NOT EXISTS (
SELECT 1 FROM (select * from TABLE_NAME) T3
WHERE T3.EXP_DT < T2.EXP_DT
AND T3.EXP_DT > T1.EXP_DT
AND (T3.C1 <> T2.C1 OR T3.C2 <> T2.C2 )
)
) EXP_DT
FROM (select * from TABLE_NAME) T1
) T0
GROUP BY
T0.C1,
T0.C2,
T0.EXP_DT
ORDER BY MIN_EFF_DT ASC
推荐答案
很可能,对id
列进行修改后,以前的解决方案将起作用:
In all likelihood, the previous solution will work when modified for the id
column:
select id, c1, c2, min(eff_dt), max(exp_dt)
from (select t.*,
row_number() over (partition by id order by eff_dt) as seqnum,
row_number() over (partition by id, c1, c2 order by eff_dt) as seqnum_1
from t
) t
group by id, c1, c2, (seqnum - seqnum_1);
您应该能够像现在那样扩展列数.
You should be able to expand the number of columns as you with.
这篇关于如何在Impala中使用大量数据解决空白岛问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!