本文介绍了为什么即使将映射器和化简器的数量设置为1,蜂巢仍将2个零件文件写入hdfs的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个配置单元插入覆盖查询- set mapred.map.tasks = 1;设置mapred.reduce.tasks = 1;插入覆盖表staging.table1 partition(dt)从testing.table1选择*;

I have a hive insert overwrite query - set mapred.map.tasks=1; set mapred.reduce.tasks=1; insert overwrite table staging.table1 partition(dt) select * from testing.table1;

检查HDFS目录中的staging.table1时,发现创建了2个零件文件.

When I inspect the HDFS directory for staging.table1, I see that there are 2 part files created.

2019-12-25 02:25 /data/staging/table1/dt=2019-12-24/000000_0
2019-12-25 02:25 /data/staging/table1/dt=2019-12-24/000001_0

为什么要创建2个文件?

Why is it that 2 files are created?

我正在使用beeline客户程序和蜂巢2.1.1-cdh6.3.1

I am using beeline client and hive 2.1.1-cdh6.3.1

推荐答案

您执行的 insert 查询是仅映射的,这意味着没有reduce任务.因此,没有必要设置 mapred.reduce.tasks .

The insert query you executed is map-only, which means there is no reduce task. So there's no point of setting mapred.reduce.tasks.

此外,映射器的数量由分割数决定,因此设置 mapred.map.tasks 不会改变映射器的并行性.

Also, the number of mapper is determined by the num of splits, so setting mapred.map.tasks won't change the parallelism of mappers.

至少有两种可行的方法将文件总数强制为1:

There are at least two feasible ways to enforce the resultant num of files to be 1:

  1. 强制执行作业以进行文件合并.
    hive.merge.mapfiles 设置为true.好吧,默认值已经为true.
    减小 hive.merge.smallfiles.avgsize 实际触发合并.
    合并后,将 hive.merge.size.per.task 增加到足够大作为目标大小.
  2. 配置映射器的文件合并行为以减少映射器的数量.
    确保将 hive.input.format 设置为 org.apache.hadoop.hive.ql.io.CombineHiveInputFormat ,这也是默认设置.
    然后增加 mapreduce.input.fileinputformat.split.maxsize 以允许更大的拆分大小.
  1. Enforcing a post job for file merging.
    Set hive.merge.mapfiles to be true. Well, the default value is already true.
    Decrease hive.merge.smallfiles.avgsize to actually trigger the merging.
    Increase hive.merge.size.per.task to be big enough as the target size after merging.
  2. Configuring the file merging behavior of mappers to cut down num of mappers.
    Make sure that hive.input.format is set to org.apache.hadoop.hive.ql.io.CombineHiveInputFormat, which is also the default.
    Then increase mapreduce.input.fileinputformat.split.maxsize to allow larger split size.

这篇关于为什么即使将映射器和化简器的数量设置为1,蜂巢仍将2个零件文件写入hdfs的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-23 22:26