如何在Impala中使用大量数据解决空白岛问题

本文介绍了如何在Impala中使用大量数据解决空白岛问题的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在Impala表中具有类型2维度，该表具有〜102列的约500M行:(C1，C2，...，C8，... C100，Eff_DT，EXP_DT)只需要选择具有不同组合值(C1，C2，... C8)的行.对于每个选定的记录，EFF_DT和EXP_DT必须分别是该记录所属组的min(eff_dt)和max(eff_dt)(此处的组由不同的组合(C1，C2，..，C8定义)

Have a Type 2 Dimension residing in an Impala table with ~500M rows having 102 columns : ( C1, C2, ..., C8,...C100, Eff_DT, EXP_DT)Need to select only the rows that have distinct combination value of (C1,C2,..,C8).For each selected record, the EFF_DT and EXP_DT must be respectively the min(eff_dt) and max(eff_dt) of the group to which that record belongs ( a group here is defined by a distinct combination (C1,C2,..,C8)

简单的分组依据不会解决这里的问题，因为它将忽略同一组的时间延迟...

A simple Group By will not solve the problem here because it will omit the time lags for the same group ...

为简单起见，这是必需的，也是我尝试过的-假设只有2列定义了一个组(而不是8)这是一个输入，所需输出和仅使用group by ...

For the sake of simplicity, here is what is required and what i have tried - assuming that only 2 columns are defining a group (not 8 )Here is an example of input, desired output and output by using only group by ...

--INPUT                              --DESIRED OUTPUT                     --OUTPUT of SIMPLE GROUP BY
------------------------------------------------------------------------------------------------------------
C1  C2  EFF_DT      EXP_DT           C1   C2  Eff_dt      EXP_DT          C1   C2  EFF_DT       EXP_DT
4   8   2013-11-30  2014-01-22       4    8   2013-11-30  2014-01-22      4    8   2013-11-30   2999-12-31
2   8   2014-01-23  2014-01-23       2    8   2014-01-23  2014-01-23      2    8   2014-01-23   2014-01-23
4   8   2014-01-24  2015-12-31       4    8   2014-01-24  2999-12-31
4   8   2016-01-01  2016-12-31
4   8   2017-01-01  2018-03-15
4   8   2018-03-16  2018-07-24
4   8   2018-07-25  2999-12-31

试图在select语句中使用子查询来基于当前行选择max(exp_dt)，但由于impala不支持而无法工作...

Tried to use a subquery inside select statement to select max(exp_dt) based on current row but didnt work as impala does not support that .....

这是我尝试过的查询，它运行正常，但在Impala中却无法运行(因为在select语句中不支持子查询

Here is the query i tried , which is working fine but not in Impala (Because subqueries are not supported inside select statements

SELECT
     T0.C1,
     T0.C2,
     MIN(T0.EFF_DT) AS MIN_EFF_DT,
     T0.EXP_DT
FROM (
    SELECT
    T1.C1,
    T1.C2,
    T1.EFF_DT,
    (
        SELECT MAX(T2.EXP_DT)
        FROM (select * from TABLE_NAME ) T2
        WHERE T2.C1 = T1.C1
        AND   T2.C2 = T1.C2
        AND NOT EXISTS (
        SELECT 1 FROM (select * from TABLE_NAME) T3
            WHERE T3.EXP_DT < T2.EXP_DT
            AND   T3.EXP_DT > T1.EXP_DT
            AND  (T3.C1 <> T2.C1 OR T3.C2 <> T2.C2 )
        )

    ) EXP_DT
    FROM (select * from TABLE_NAME) T1
) T0
GROUP BY
T0.C1,
T0.C2,
T0.EXP_DT
ORDER BY MIN_EFF_DT ASC

the

如何在Impala中使用大量数据解决空白岛问题

问题描述

推荐答案