


I can parse the document and generate an output however the output cannot be parsed into an XElement because of a p tag, everything else within the string is parsed correctly.


var input = "<p> Not sure why is is null for some wierd reason!<br><br>I have implemented the auto save feature, but does it really work after 100s?<br></p> <p> <i>Autosave?? </i> </p> <p>we are talking...</p><p></p><hr><p><br class=\"GENTICS_ephemera\"></p>";


My code:

public static XElement CleanupHtml(string input)

    HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();

    htmlDoc.OptionOutputAsXml = true;
    //htmlDoc.OptionWriteEmptyNodes = true;             
    //htmlDoc.OptionAutoCloseOnEnd = true;
    htmlDoc.OptionFixNestedTags = true;


    // ParseErrors is an ArrayList containing any errors from the Load statement
    if (htmlDoc.ParseErrors != null && htmlDoc.ParseErrors.Count() > 0)


        if (htmlDoc.DocumentNode != null)
            var ndoc = new HtmlDocument(); // HTML doc instance
            HtmlNode p = ndoc.CreateElement("body");  

            p.InnerHtml = htmlDoc.DocumentNode.InnerHtml;
            var result = p.OuterHtml.Replace("<br>", "<br/>");
            result = result.Replace("<br class=\"special_class\">", "<br/>");
            result = result.Replace("<hr>", "<hr/>");
            return XElement.Parse(result, LoadOptions.PreserveWhitespace);
    return new XElement("body");



   <p> Not sure why is is null for some wierd reason chappy!
   <br/>I have implemented the auto save feature, but does it really work after 100s?
   <i>Autosave?? </i> 
   <p>we are talking...</p>


The bold p tag is the one that did not output correctly... Is there a way around this? Am I doing something wrong with the code?



What you are trying to do is basically transform an Html input into an Xml output.

的HTML敏捷性包可以做的,当你使用 OptionOutputAsXml 选项,但在这种情况下,你不应该使用的innerHTML属性,而是让的Html敏捷包做为你的基础工作,用的HTMLDocument的保存的方法之一。

Html Agility Pack can do that when you use the OptionOutputAsXml option, but in this case, you should not use the InnerHtml property, and instead let the Html Agility Pack do the ground work for you, with one of HtmlDocument's Save methods.


Here is a generic function to convert an Html text to an XElement instance:

public static XElement HtmlToXElement(string html)
    if (html == null)
        throw new ArgumentNullException("html");

    HtmlDocument doc = new HtmlDocument();
    doc.OptionOutputAsXml = true;
    using (StringWriter writer = new StringWriter())
        using (StringReader reader = new StringReader(writer.ToString()))
            return XElement.Load(reader);

正如你看到的,你不必自己做大量的工作!请注意,由于原始的输入文本没有根元素,在HTML敏捷性包会自动添加一个封闭的 SPAN ,以确保输出是有效的XML。

As you see, you don't have to do much work by yourself! Please note that since your original input text has no root element, the Html Agility Pack will automatically add one enclosing SPAN to ensure the output is valid Xml.


In your case, you want to additionnally process some tags, so, here is how to do with your exemple:

    public static XElement CleanupHtml(string input)
        if (input == null)
            throw new ArgumentNullException("input");

        HtmlDocument doc = new HtmlDocument();
        doc.OptionOutputAsXml = true;

        // extra processing, remove some attributes using DOM
        HtmlNodeCollection coll = doc.DocumentNode.SelectNodes("//br[@class='special_class']");
        if (coll != null)
            foreach (HtmlNode node in coll)

        using (StringWriter writer = new StringWriter())
            using (StringReader reader = new StringReader(writer.ToString()))
                return XElement.Load(reader);


As you see, you should not use raw string function, but instead use the Html Agility Pack DOM functions (SelectNodes, Add, Remove, etc...).


09-25 18:03