本文介绍了在大型(50 GB)数据集上,没有“分区BY子句"的Hive窗口函数ROW_NUMBER非常慢.有更好的优化方法吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个具有5000万条记录的HDFS文件,原始文件大小为50 GB.

I have a HDFS file with 50 Million records and raw file size is 50 GB.

我正在尝试将其加载到配置单元表中,并在加载时使用以下内容为所有行创建唯一的ID.我正在使用Hive 1.1.0-cdh5.16.1.

I am trying to load this in a hive table and create unique id for all rows using the below, while loading. I am using Hive 1.1.0-cdh5.16.1.

row_number()超过(按event_id,user_id,时间戳排序)为ID

row_number() over(order by event_id, user_id, timestamp) as id

在执行的同时,我看到在减速步骤中分配了40个减速器.39个Reducer的平均时间约为2分钟,而最后一个Reducer大约需要25分钟,这清楚地使我相信,大多数数据是在一个Reducer中处理的.

While executing I see that in the reduce step, 40 reducers are assigned. Average time for 39 Reducers is about 2 mins whereas the last reducer takes about 25 mins which clearly makes me believe that most of the data is processed in one reducer.

我怀疑Order By子句是造成此问题的原因,并尝试了以下方法,

I suspected Order By clause to be the reason for this behavior and tried the below,

row_number()over()as id

row_number() over() as id

但是,我看到的是相同的行为.

Yet, I see the same behavior.

考虑到Map Reduce范式,我觉得如果不指定Partition BY子句,则必须在一个reducer(未分配)中处理数据,以便查看所有行并附加正确的行数字.对于任何在倾斜列上没有分区By子句或partition By的Window函数,都是如此.

Thinking about the Map Reduce Paradigm, it makes me feel that if we do not specify a Partition BY Clause, the data has to be processed in one reducer (un-distributed) in order to see all rows and attach the correct row number. This could be true for any Window function with no partition By clause or partition By on skewed column.

现在,我的问题是,当我们不得不避免使用Partition BY子句时,我们该如何规避此问题并优化窗口函数?

Now, my question is, how do we circumvent this problem and optimize window functions when we have to avoid Partition BY clause?

推荐答案

您可以使用UUID:

select java_method('java.util.UUID','randomUUID')

在系统/工作流程中生成的

UUID在其他系统中也将是唯一的,因为UUID是全局唯一的.UUID可以完全分布式且快速地工作.

UUID generated in your system/workflow will be also unique in some other system because UUID is globally unique. UUID works fully distributed and fast.

在Hive 3.x中也有可以在DDL中使用的SURROGATE_KEY 函数

Also in Hive 3.x there is SURROGATE_KEY function which you can use in the DDL

这篇关于在大型(50 GB)数据集上,没有“分区BY子句"的Hive窗口函数ROW_NUMBER非常慢.有更好的优化方法吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-31 03:37