解析大于hdfs块大小的XmlInputFormat元素

本文介绍了解析大于hdfs块大小的XmlInputFormat元素的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是Hadoop MapReduce的新手（精确4天），我被要求在集群上执行分布式XML解析。根据我在互联网上的（重新）搜索，使用Mahout的XmlInputFormat应该相当容易，但我的任务是确保系统适用于大型（〜5TB）XML文件。

据我所知，发送到映射器的文件拆分不能大于hdfs块大小（或每个作业块大小）。（纠正我，如果我错了）。

我遇到的问题是一些XML元素很大（〜200MB），有些很小（〜1MB）

所以我的问题是：当由XmlInputFormat创建的XML元素块大于块大小时会发生什么？它会将整个大文件（比如200MB）发送给映射器，还是将它发送出三个分片（64 + 64 + 64 + 8）？

我目前无法访问该公司的hadoop群集（并且不会在某个时间之前），因此我无法执行测试并查明结果。请帮助我。

解决方案

所以要清理一些事情：

Mahout的XMLInputFormat将处理XML文件并在两个配置好的开始/结束标签之间提取XML。因此，如果您的XML如下所示：

 < main> 
< person> 
<名称>鲍勃< /名称> 
< dob> 1970/01/01< / dob> 
< / person> 
< / main>

并且您已将开始/结束标记配置为< person> ; 和< / person> ，那么你的映射器将被传递给下面的< LongWritable，Text>

  LongWritable：10 
 Text：< person> \\ \\ n< name> Bob< / name> \\\
< dob> 1970/01/01< / dob> \\\
< / person>

您在映射器中对这些数据所做的工作由您决定。

关于拆分， XmlInputFormat 扩展 TextInputFormat ，所以如果你输入了文件是可拆分的（即未压缩或用可拆分的编解码器如snappy压缩），则文件将由一个或多个映射器处理，如下所示：

如果输入文件大小（假设为48 MB）小于HDFS中的单个块（可以说是64MB），并且未配置最小/最大拆分大小属性，那么您将得到一个映射器处理文件

与上面一样，但是您将最大分割大小配置为10MB（ mapred.max.split.size = 10485760 ），那么你将得到5个地图任务来处理文件

如果文件大于块大小，那么你会得到每个块的地图任务，或者如果最大分割大小被配置，该分割大小的文件分割的每个部分的映射

当文件被拆分成这些块或拆分大小块时，XmlInputFormat将寻找块/拆分边界的字节地址/然后向前扫描，直到找到已配置的XML开始标记或达到块/拆分边界的字节地址。如果它找到开始标记，它将消耗数据直到找到结束标记（或文件结束）。如果它发现结束标记，记录将传递给你的映射器，否则你的映射器将不会收到任何输入。要强调的是，当试图找到结束标记时，地图可能会扫描块/分割的结尾，但只有在找到开始标记时才会执行此操作，否则将在块/分割结束时停止扫描。

我希望这是有道理的。

编辑

要跟进您的评论：

是的，每个映射程序都会尝试处理文件的分割（字节范围）

是的，不管你设置了什么最大分割大小，您的映射程序将接收代表（包含）开始/结束标记之间数据的记录。 person元素不会被分割，不管它的大小是多少（显然，如果开始元素和结束元素之间存在GB的数据，那么很可能会用尽内存，试图将其缓存到Text对象中）。 >

继续上面的内容，你的数据永远不会在开始和结束元素之间分裂，一个人元素将被全部发送给一个映射器，所以你应该总是使用类似于SAX解析器可以进一步处理它，而不必担心只能看到人员元素的一部分。

I'm new to Hadoop MapReduce (4 days to be precise) and I've been asked to perform distributed XML parsing on a cluster. As per my (re)search on the Internet, it should be fairly easy using Mahout's XmlInputFormat, but my task is to make sure that the system works for huge (~5TB) XML files.
As per my knowledge, the file splits sent to the mappers cannot be larger than the hdfs block size (or the per-job block size). [Correct me if I'm mistaken].
The issue I'm facing is that some XML elements are large (~200MB) and some are small (~1MB)
So my question is: What happens when the XML element chunk created by XmlInputFormat is bigger than the block size? Will it send the entire large file (say 200MB) to a mapper or will it send out the element in three splits (64+64+64+8)??
I currently don't have access to the company's hadoop cluster (and wont be until sometime) so I cannot perform a test and find out. Kindly help me out.
解决方案
So to clear somethings up:
Mahout's XMLInputFormat will process XML files and extract out the XML between two configured start / end tags. So if your XML looks like the following:
<main> <person> <name>Bob</name> <dob>1970/01/01</dob> </person> </main>
and you've configured the start / end tags to be <person> and </person>, then your mapper will be passed the following <LongWritable, Text> pair to its map method:
LongWritable: 10 Text: "<person>\n <name>Bob</name>\n <dob>1970/01/01</dob>\n </person>"
What you do with this data in your mapper is then up to you.
With regards to splits, XmlInputFormat extends TextInputFormat, so if you're input file is splittable (i.e. uncompressed or compressed with a splittable codec such as snappy), then the file will be processed by one or more mappers as follows:
If the input file size (let's say 48 MB) is less than a single block in HDFS (lets say 64MB), and you don't configure min / max split size properties, then you'll get a single mapper to process the file
As with the above, but you configure max split size to be 10MB (mapred.max.split.size=10485760), then you'll get 5 map tasks to process the file
If the file is bigger than the block size then you'll get a map task for each block, or if the max split size is configured, a map for each part of the file division by that split size
When the file is split up into these block or split sized chunks, the XmlInputFormat will seek to byte address/offset of the block / split boundaries and then scan forwards until it finds either the configured XML start tag or reaches the byte address of the block/split boundary. If it finds the start tag, it will then consume data until it finds the end tag (or end of file). If it finds the end tag a record will be passed to your mapper, otherwise your mapper will not receive any input. To emphasize, the map may scan past the end of the block / split when trying to find the end tag, but will only do this if it has found a start tag, otherwise scanning stops at the end of the block/split.
So to (eventually) answer your question, if you haven't configured a mapper (and are using the default or identify mapper as it's also known), then yes, it doesn't matter how big the XML chunk is (MB's, GB's, TB's!) it will be sent to the reducer.
I hope this makes sense.
EDIT
To follow up on your comments:
Yes, each mapper will attempt to process its split (range of bytes) of the file
Yes, regardless of what your set the max split size too, your mapper will receive records which represent the data between (inclusive) of the start / end tags. The person element will not be split up not matter what it's size is (obviously if there is GB's of data between the start and end element, you'll most probably run out of memory trying to buffer it into a Text object)
Continuing from the above, your data will never be split up between the start and end element, a person element will be sent in its entirity to a mapper, so you should always be ok using something like a SAX parser to further process it without fear that you're only seeing a portion of the person element.

这篇关于解析大于hdfs块大小的XmlInputFormat元素的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！