解析和遍历Markdown文件中的元素

本文介绍了解析和遍历Markdown文件中的元素的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想解析然后遍历Markdown文件.我正在寻找类似 xml.etree.ElementTree 之类的东西但是对于Markdown.

I want to parse and then traverse a Markdown file. I'm looking for something like xml.etree.ElementTree but for Markdown.

一个选择是转换为HTML，然后使用另一个库来解析HTML.但我想避免这一步.

One option would be to convert to HTML and then use another library to parse the HTML. But I'd like to avoid that step.

谢谢.

推荐答案

正如提到的另一条评论，Python-Markdown具有扩展API ，它恰好是在后台使用xml.etree.ElementTree.从理论上讲，您可以创建访问该内部ElementTree对象的扩展，并对其进行所需的操作.但是，如果您使用原始HTML(包括HTML实体)和/或codehilite扩展名，则将获得不完整的文档，因为在序列化的字符串上运行了一些后处理器.因此，我不会真的为您的预期目的推荐它(完整披露:我是Python-Markdown的开发人员).

As another comment mentioned, Python-Markdown has an extension API and it happens to use xml.etree.ElementTree under the hood. You could theoretically create an extension that accesses that internal ElementTree object and do what you want with it. However, if you use raw HTML (including HTML entities) and/or the codehilite extension, you will get an incomplete document as there are a few postprocessors that run on the serialized string. So I wouldn't really recommenced it for your intended purpose (full disclosure: I'm the developer of Python-Markdown).

如果Markdown实现存在此处，则列表很长.在该列表中的纯Python实现中， Mistune 是我所知道的唯一一个使用两步过程(第一步返回一个分析树，第二步序列化分析树-您只需要第一步).我从来没有亲自使用过Mistune，也无法谈论其稳定性或准确性，但是它应该是非常好的JavaScript库的Python克隆已标记.

A rather lengthy list if Markdown implementations exists here. Of the pure Python implementations in that list, Mistune is the only one that I am aware of that uses a two step process (step one returns a parse tree, step two serializes the parse tree -- you only need step one). I have never used Mistune personally and cannot speak to its stability or accuracy, but it is supposed to be a Python clone of the very good JavaScript library Marked.

如果您四处搜索，我相信一些C实现会使用类似的模式.其中一些甚至可能已经具有Python包装器.如果没有，使用 ctypes 创建包装器应该不会太困难.

If you search around, I believe that a few of the C implementations use a similar pattern. Some of them might even already have a Python wrapper. If not, it shouldn't to too difficult to create a wrapper with ctypes.

如果由于某种原因您想要使用一个没有完整解析树的实现，那么我建议使用 LXML (C库的python包装器)或 html5lib (纯python )，两者都可以返回ElementTree对象，并且返回速度更快(尤其是LXML)，并且对无效HTML的容忍度更高(尤其是html5lib，其行为更像现实世界中的真实浏览器).请记住，Markdown可以包含原始HTML，并且大多数Markdown解析器只是通过有效与否进行传递.如果您随后尝试使用基于XML的解析器(例如xml.etree中的解析器)或严格的HTML解析器(例如标准lib中的html.parser)来解析它，则单个无效标记可能会使HTML解析器崩溃.

If for some reason you want to use an implementation that does not give you a full parse tree, then I would suggest parsing the resulting HTML using LXML (A python wrapper of the C lib) or html5lib (pure python), both of which can return an ElementTree object and are much faster (especially LXML) and more forgiving of invalid HTML (especially html5lib, which acts more like real browsers in the real world). Remember that Markdown can contain raw HTML and most Markdown parsers simply pass it through, valid-or-not. If you then try to parse it with a XML based parser (like in xml.etree) or a strict HTML parser (like html.parser in the standard lib), a single invalid tag can crash the HTML parser.

这篇关于解析和遍历Markdown文件中的元素的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！