Htmlagilitypack - Does not format correctly

Asked

Viewed 44 times

0

In my webservice I need to always format the HTML I receive. To make sure it gets properly formatted I use the Htmlagilitypack.

HTML I receive:

<p>
    <div>
        <b>text:</b> 
        <img alt="" height="362" src="/PublishingImages/imageName.png?RenditionID="16&Width=639&Height=362" width="639" style="BORDER: 0px solid; ">
    </div>
    <div>             <!---assim aberto é que está bem-->
        <b>text:</b> 
        <div style="text-align:justify;"></div>
        <div style="text-align:justify;"></div>
        <p style="text-align:justify;">
            <span class="ms-rteThemeForeColor-2-0">
                <br>
                text
            </span>
        </p>
        <p style="text-align:justify;">
            <br class="ms-rteThemeForeColor-2-0">
            <span class="ms-rteThemeForeColor-2-0">
                text
                <br>
            </span>
        </p>
        <p style="text-align:justify;">
            <span class="ms-rteThemeForeColor-2-0">
                <br>
            </span>
        </p>
        <p style="text-align:justify;">
            <span class="ms-rteThemeForeColor-2-0">
                text
            </span>
            <br>
        </p>
    </div>
</p>

My code to format HTML:

if (!HtmlNode.ElementsFlags.ContainsKey("p"))
    HtmlNode.ElementsFlags.Add("p", HtmlElementFlag.Closed);
else
    HtmlNode.ElementsFlags["p"] = HtmlElementFlag.Closed;

if (!HtmlNode.ElementsFlags.ContainsKey("span"))
    HtmlNode.ElementsFlags.Add("span", HtmlElementFlag.Closed);
else
    HtmlNode.ElementsFlags["span"] = HtmlElementFlag.Closed;

if (!HtmlNode.ElementsFlags.ContainsKey("div"))
    HtmlNode.ElementsFlags.Add("div", HtmlElementFlag.Closed);
else
    HtmlNode.ElementsFlags["div"] = HtmlElementFlag.Closed;

var htmlDoc = new HtmlDocument();
htmlDoc.OptionFixNestedTags = true;
htmlDoc.OptionWriteEmptyNodes = true;
htmlDoc.LoadHtml(myHtml);

foreach (var eachNode in htmlDoc.DocumentNode.SelectNodes("//*"))
{
    var count = 0;
    foreach (var attr in eachNode.Attributes)
        if (attr.Name.ToLower() != "href" && attr.Name.ToLower() != "src" && attr.Name.ToLower() != "alt" && attr.Name.ToLower() != "style")
        {
            attr.Name = "feeds" + count.ToString();
            attr.Value = "";
            count++;
        }
}

var htmlError = htmlDoc.ParseErrors.SafeAny();

if (!htmlError)
    myHtml = htmlDoc.DocumentNode.InnerHtml;

However, the Htmlagilitypack is displaying HTML slightly compared to the initial HTML.

HTML after formatted by Htmlagilitypack:

<p>
   <div>
      <b>text:</b>
      <img alt="" feeds0="" src="/PublishingImages/imageName.png?RenditionID=" feeds1="" feeds2="" style="BORDER: 0px solid; " />
   </div>
   <div />             <!---não devia estar fechado-->
   <b>text:</b>
   <div style="text-align:justify;" />
   <div style="text-align:justify;">          <!---não devia estar aberto-->
      <p style="text-align:justify;">
         <span feeds0="">
            <br />
            text
         </span>
      </p>
      <p style="text-align:justify;">
         <br />
         <span feeds0="">
            text
            <br />
         </span>
      </p>
      <p style="text-align:justify;">
         <span feeds0="">
            <br />
         </span>
      </p>
      <p style="text-align:justify;">
         <span feeds0="">
            text
         </span>
         <br />
      </p>
   </div>
</p>

Why does this happen and how can I fix this? I have already found that if I comment on the following code the HTML is properly formatted:

if (!HtmlNode.ElementsFlags.ContainsKey("div"))
    HtmlNode.ElementsFlags.Add("div", HtmlElementFlag.Closed);
else
    HtmlNode.ElementsFlags["div"] = HtmlElementFlag.Closed;

But why? I can’t uncomment this code because if HTML has one div badly closed, I will have problems later on and therefore it will have to be all closed.

NOTE: where is <div /> <!---não devia estar fechado--> should just be <div> as in the first HTML I receive.

1 answer

0


Solved! I’ve found that you can tell HtmlNode.ElementsFlags which should be Closed and CanOverlap at the same time, in this way:

if (!HtmlNode.ElementsFlags.ContainsKey("div"))
    HtmlNode.ElementsFlags.Add("div", HtmlElementFlag.CanOverlap & HtmlElementFlag.Closed);
else
    HtmlNode.ElementsFlags["div"] = HtmlElementFlag.CanOverlap & HtmlElementFlag.Closed;

Browser other questions tagged

You are not signed in. Login or sign up in order to post.