本文介绍了解析数百万个小型 XML 文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有 1000 万个小 XML 文件(300KB-500KB).我在 Mapreduce 中使用 Mahaout 的 XML 输入格式来读取数据,并且我正在使用 SAX Parser 进行解析.但是处理速度很慢.使用输入文件的压缩(lzo)是否有助于提高性能?每个文件夹包含 80-90k xml 文件,当我启动该过程时,它会为每个文件运行映射器.有什么方法可以减少映射器的数量?

I have 10 million small XML files(300KB-500KB).i'm using Mahaout's XML input format in Mapreduce to read the data and i'm using SAX Parser for parsing. But Processing is very slow.will using compression(lzo) of input files help in increse performance?Each folder contains 80-90k xml files and when i start the process it run mapper for each file.is there any way to reduce no of mappers?

推荐答案

Hadoop 不能很好地处理大量小文件.它旨在处理一些非常大的文件.

Hadoop doesn't work very well with a huge amount of small files. It was designed to deal with few very big files.

压缩文件无济于事,因为您已经注意到问题是您的工作需要实例化大量容器来执行映射(每个文件一个容器).实例化容器所花费的时间可能超过处理输入所需的时间(以及内存和 CPU 等大量资源).

Compress your files won't help because as you have noticed the problem is that your job require to instantiate a lot of containers to execute the maps (one for each file). Instantiate containers could take more than the time required to process the input (and a lot of resources like memory and CPU).

我不熟悉 Mahaout 的输入格式,但在 hadoop 中,有一个类可以最大限度地减少在一个 Mapper 中组合多个输入的问题.该类是 CombineTextInputFormat.要使用 XML,您可能需要创建自己的 XMLInputFormat 扩展 CombineFileInputFormat.

I'm not familiar with Mahaout's input formats but in hadoop there is a class that minimize that problem combining several inputs in one Mapper. The class is CombineTextInputFormat. To work with XML's you may require to create your own XMLInputFormat extending CombineFileInputFormat.

另一种方法是在容器中重用 JVM,但改进较少:重用 JVM在 Hadoop mapreduce 作业中

Another alternative but with less imprvement could be reuse the JVM among the containers: reuse JVM in Hadoop mapreduce jobs

重用 JVM 可以节省创建每个 JVM 所需的时间,但您仍然需要为每个文件创建一个容器.

Reusing the JVM safe the time required to create each JVM but you are still requiring create one container for each file.

这篇关于解析数百万个小型 XML 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

05-20 20:21