在PHP中验证〜400MB的大型XML文件

本文介绍了在PHP中验证〜400MB的大型XML文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个很大的XML文件(大约400MB)，在开始处理它之前，需要确保它的格式正确.

I have a large XML file (around 400MB) that I need to ensure is well-formed before I start processing it.

我尝试过的第一件事类似于以下内容，这非常好，因为我可以发现XML是否格式不正确以及XML的哪些部分不好"

First thing I tried was something similar to below, which is great as I can find out if XML is not well formed and which parts of XML are 'bad'

$doc = simplexml_load_string($xmlstr);
if (!$doc) {
    $errors = libxml_get_errors();

    foreach ($errors as $error) {
        echo display_xml_error($error);
    }

    libxml_clear_errors();
}

也尝试过...

$doc->load( $tempFileName, LIBXML_DTDLOAD|LIBXML_DTDVALID )

我用大约60MB的文件进行了测试，但是任何较大的文件(〜400MB)都会引起我的新感觉，动态杀手"会在似乎总是30秒钟后启动并终止脚本.

I tested this with a file of about 60MB, but anything a lot larger (~400MB) causes something which is new to me "oom killer" to kick in and terminate the script after what always seems like 30 secs.

我认为我可能需要增加脚本的内存，以便找出处理60MB时的峰值使用情况，并相应地对其进行较大调整，并关闭脚本时间限制，以防万一.

I thought I may need to increase the memory on the script so figured out the peak usage when processing 60MB and adjusted it accordingly for a large and also turn the script time limit off just in case it was that.

set_time_limit(0);
ini_set('memory_limit', '512M');

不幸的是，这没有用，因为如果内存负载(即使是正确的术语?)始终很高，则oom killer似乎是一种Linux事物.

Unfortunately this didn't work, as oom killer appears to be a linux thing that kicks in if memory load (even the right term?) is consistently high.

如果我能以某种方式分块加载xml会很棒，因为我想这会减少内存负载，以使oom killer不会黏住它的肥肉，杀死我的进程.

It would be great if I could load xml in chunks somehow as I imagine this will reduce the memory load so that oom killer doesn't stick it's fat nose in and kill my process.

任何人都没有验证大型XML文件并捕获格式错误的错误的经验，我读过的很多文章都指向SAX和XMLReader，可以解决我的问题.

Does anyone have any experience validating a large XML file and capturing errors of where it's badly formed, a lot of posts I've read point to SAX and XMLReader that might solve my problem.

更新所以@chiborg为我解决了这个问题...这种方法的唯一缺点是我看不到文件中的所有错误，只是第一个失败的错误，我认为这很有意义无法解析失败的第一点.

UPDATESo @chiborg pretty much solved this issue for me...the only downside to this method is that I don't get to see all of the errors in the file, just the first that failed which I guess makes sense as I think it can't parse past the first point that fails.

使用simplexml ...时，它可以捕获文件中的大多数问题，并在最后显示给我看.

When using simplexml...it's able to capture most of the issues in the file and show me at the end which was nice.

推荐答案

由于SimpleXML和DOM API始终会将文档加载到内存中，因此使用流解析器(例如SAX或XMLReader)是更好的方法.

Since the SimpleXML and DOM APIs will always load the document into memory, using a streaming parser like SAX or XMLReader is the better approach.

在示例页面中修改代码，这个:

$xml_parser = xml_parser_create();
if (!($fp = fopen($file, "r"))) {
    die("could not open XML input");
}

while ($data = fread($fp, 4096)) {
    if (!xml_parse($xml_parser, $data, feof($fp))) {
        $errors[] = array(
                    xml_error_string(xml_get_error_code($xml_parser)),
                    xml_get_current_line_number($xml_parser));
    }
}
xml_parser_free($xml_parser);

这篇关于在PHP中验证〜400MB的大型XML文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！