问题描述
我有一个配置单元插入覆盖查询- set mapred.map.tasks = 1;设置mapred.reduce.tasks = 1;插入覆盖表staging.table1 partition(dt)从testing.table1选择*;
I have a hive insert overwrite query - set mapred.map.tasks=1; set mapred.reduce.tasks=1; insert overwrite table staging.table1 partition(dt) select * from testing.table1;
检查HDFS目录中的staging.table1时,发现创建了2个零件文件.
When I inspect the HDFS directory for staging.table1, I see that there are 2 part files created.
2019-12-25 02:25 /data/staging/table1/dt=2019-12-24/000000_0
2019-12-25 02:25 /data/staging/table1/dt=2019-12-24/000001_0
为什么要创建2个文件?
Why is it that 2 files are created?
我正在使用beeline客户程序和蜂巢2.1.1-cdh6.3.1
I am using beeline client and hive 2.1.1-cdh6.3.1
推荐答案
您执行的 insert
查询是仅映射的,这意味着没有reduce任务.因此,没有必要设置 mapred.reduce.tasks
.
The insert
query you executed is map-only, which means there is no reduce task. So there's no point of setting mapred.reduce.tasks
.
此外,映射器的数量由分割数决定,因此设置 mapred.map.tasks
不会改变映射器的并行性.
Also, the number of mapper is determined by the num of splits, so setting mapred.map.tasks
won't change the parallelism of mappers.
至少有两种可行的方法将文件总数强制为1:
There are at least two feasible ways to enforce the resultant num of files to be 1:
- 强制执行作业以进行文件合并.
将hive.merge.mapfiles
设置为true.好吧,默认值已经为true.
减小hive.merge.smallfiles.avgsize
实际触发合并.
合并后,将hive.merge.size.per.task
增加到足够大作为目标大小. - 配置映射器的文件合并行为以减少映射器的数量.
确保将hive.input.format
设置为org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
,这也是默认设置.
然后增加mapreduce.input.fileinputformat.split.maxsize
以允许更大的拆分大小.
- Enforcing a post job for file merging.
Sethive.merge.mapfiles
to be true. Well, the default value is already true.
Decreasehive.merge.smallfiles.avgsize
to actually trigger the merging.
Increasehive.merge.size.per.task
to be big enough as the target size after merging. - Configuring the file merging behavior of mappers to cut down num of mappers.
Make sure thathive.input.format
is set toorg.apache.hadoop.hive.ql.io.CombineHiveInputFormat
, which is also the default.
Then increasemapreduce.input.fileinputformat.split.maxsize
to allow larger split size.
这篇关于为什么即使将映射器和化简器的数量设置为1,蜂巢仍将2个零件文件写入hdfs的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!