本文介绍了分组集大小不能大于64:Hive的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的配置单元表中有70列,我想获取所有具有完全匹配的70列的所有行.如果两行在所有列中都包含相同的数据,那么我需要找到该行并计为"2".我在下面写查询.

I have 70 columns in my hive table i want to fetch all the rows which have exactly all the 70 matching columns.i.e. if two rows contain same data in all the column then i need to find that row and count as '2'. I'm writing below query.

    SELECT (all 70 columns),COUNT(*) AS CountOf FROM tablename GROUP BY (all 70 columns) 
    HAVING  COUNT(*)>1;

但显示

    Error: Error while compiling statement: FAILED: SemanticException [Error 10411]: 
    Grouping sets size cannot be 
    greater than 64 (state=42000,code=10411)

有什么方法可以从配置单元表中找到确切重复行的计数吗?

is there any way to find the exact duplicate rows's count from hive table?

推荐答案

这是一个错误Hive 3.1.0 版本中的HIVE-21135 ,已在Hive 4.0.0 中修复,请参见 HIVE-21018 ,未反向移植.

It's a bug HIVE-21135 in Hive 3.1.0 version, it is fixed in the Hive 4.0.0, see HIVE-21018, not backported.

尝试在聚合之前使用子查询中的定界符将列连接起来作为一种解决方法,我不确定这是否会有所帮助.

Try to concatenate columns using delimiter in the subquery before aggregation as a workaround, I'm not sure will it help or not.

像这样,使用concat()或concat_ws或 || 运算符:

like this, using concat() or concat_ws or || operator:

select concat_ws ('~', col1, col2, col3, col4)
...
group by concat_ws ('~', col1, col2, col3, col4)

col1||'~'||col2||'~'||...||colN

NULL也应注意.使用NVL函数进行连接之前,请将空字符串替换为空.

NULLs should be taken care also. Replace nulls with empty strings before concatenation using NVL function.

这篇关于分组集大小不能大于64:Hive的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

11-03 02:13