关于Hadoop / HDFS文件分割

本文介绍了关于Hadoop / HDFS文件分割的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

想要在下面确认。请验证这是否正确：
1.根据我的理解，当我们将文件复制到HDFS时，文件（假设其大小> 64MB = HDFS块大小）被分割为多个块并且每个块存储在不同的数据节点上。

当文件被复制到HDFS时，文件内容已经被分割成块，在运行地图作业时不会发生拆分。地图任务只能按照每个最大块的工作方式进行安排。大小为64 MB，具有数据局部性（即地图任务在包含数据/块的节点上运行）。文件拆分也会在文件被压缩时发生（gzipped ），但是MR确保每个文件仅由一个映射器处理，即MR将收集位于其他数据节点的所有gzip文件块，并将它们全部提供给单个映射器。

如果我们将isSplitable（）定义为返回false，即一个文件的所有块将由一台机器上运行的一个映射器处理，则会发生与上述相同的事情。 MR将从不同的数据节点中读取文件的所有块，并将它们提供给单个映射器。

解决方案

你的理解并不理想。
我会指出，有两个几乎独立的进程：将文件拆分为HDFS块，并将文件拆分以供不同的映射器处理。

HDFS根据定义的块大小将文件拆分成块。

每种输入格式都有自己的逻辑，可以将文件如何拆分成由不同映射器独立处理的部分。 FileInputFormat的默认逻辑是通过HDFS块拆分文件。您可以实现任何其他逻辑

Compression，通常是分裂的敌人，所以我们采用分块压缩技术来分割压缩数据。这意味着文件（块）的每个逻辑部分都被独立压缩。

Want to just confirm on following. Please verify if this is correct:1. As per my understanding when we copy a file into HDFS, that is the point when file (assuming its size > 64MB = HDFS block size) is split into multiple chunks and each chunk is stored on different data-nodes.
File contents are already split into chunks when file is copied into HDFS and that file-split does not happen at the time of running map job. Map tasks are only scheduled in such a way that they work on each chunk of max. size 64 MB with data-locality (i.e. map task runs on that node which contains the data/chunk)
File splitting also happens if file is compressed (gzipped) but MR ensures that each file is processed by just one mapper, i.e. MR will collect all the chunks of gzip file lying at other data nodes and give them all to the single mapper.
Same thing as above will happen if we define isSplitable() to return false, i.e. all the chunks of a file will be processed by one mapper running on one machine. MR will read all the chunks of a file from different data-nodes and make them available to a single mapper.
解决方案
Your understanding is not ideal. I would point out that there are two, almost independent processes: splitting files into HDFS blocks, and splitting files for processing by the different mappers.
HDFS split files into blocks based on the defined block size.
Each input format has its own logic how files can be split into part for the independent processing by different mappers. Default logic of the FileInputFormat is to split file by HDFS blocks. You can implement any other logic
Compression, usually is a foe of the splitting, so we employ block compression technique to enable splitting of the compressed data. It means that each logical part of the file (block) is compressed independently.

这篇关于关于Hadoop / HDFS文件分割的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！