本文介绍了有没有一种方法可以在不使用ALTER TABLE CONCATENATE命令的情况下在HDFS中合并ORC文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我与Hive和HDFS在一起的第一周,所以请多多包涵.

This is my first week with Hive and HDFS, so please bear with me.

到目前为止,我看到的几乎所有合并多个ORC文件的方式都建议将ALTER TABLECONCATENATE命令一起使用.

Almost all the ways I saw so far to merge multiple ORC files suggest using ALTER TABLE with CONCATENATE command.

但是我需要合并同一表的多个ORC文件,而不必ALTER该表.另一种选择是创建现有表的副本,然后在该表上使用ALTER TABLE,这样我的原始表将保持不变.但是由于空间和数据冗余的原因,我也无法做到这一点.

But I need to merge multiple ORC files of the same table without having to ALTER the table. Another option is to create a copy of the existing table and then use ALTER TABLE on that so that my original table remains unchanged. But I can't do that as well because space and data redundancy reasons.

我想要实现的目标(理想情况下)是:我需要将这些ORC作为每个表的一个文件传输到云环境中.那么,有没有一种方法可以在传输过程中将移动的ORC合并到云中?是否可以使用/不使用Hive来实现,也许直接在HDFS中实现?

The thing I'm trying to achieve (ideally) is: I need to transport these ORCs as one file per table into a cloud environment. So, is there a way that I can merge the ORCs on-the-go during the transfer process into cloud? Can this be achieved with/without Hive, maybe directly in HDFS?

推荐答案

ALTER TABLE CONCATENATE之外的两种可能的方法:

Two possible methods other than ALTER TABLE CONCATENATE:

  1. 尝试配置合并任务,请在此处查看详细信息: https://stackoverflow.com/a/45266244/2700344

或者,您可以强制使用单个减速器.此方法非常适用于不太大的文件.您可以使用ORDER BY覆盖同一张表,这将在最后一个ORDER BY阶段强制使用单个化简器.对于大文件,这将工作缓慢甚至失败,因为所有数据都将通过单个reducer传递:

Alternatively you can force single reducer. This method is quite applicable for not too big files. You can overwrite the same table with ORDER BY, this will force single reducer on the last ORDER BY stage. This will work slow or even fail with big files because all the data will be passed through single reducer:

    INSERT OVERWRITE TABLE
    SELECT * FROM TABLE
      ORDER BY some_col; --this will force single reducer

作为副作用,您将获得打包更好的ORC文件,并在按by列出的列上具有高效索引.

As a side effect you will get better packed ORC file with efficient index on columns listed in order by.

这篇关于有没有一种方法可以在不使用ALTER TABLE CONCATENATE命令的情况下在HDFS中合并ORC文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-24 23:40