0
In my webservice I need to always format the HTML I receive. To make sure it gets properly formatted I use the Htmlagilitypack.
HTML I receive:
<p>
<div>
<b>text:</b>
<img alt="" height="362" src="/PublishingImages/imageName.png?RenditionID="16&Width=639&Height=362" width="639" style="BORDER: 0px solid; ">
</div>
<div> <!---assim aberto é que está bem-->
<b>text:</b>
<div style="text-align:justify;"></div>
<div style="text-align:justify;"></div>
<p style="text-align:justify;">
<span class="ms-rteThemeForeColor-2-0">
<br>
text
</span>
</p>
<p style="text-align:justify;">
<br class="ms-rteThemeForeColor-2-0">
<span class="ms-rteThemeForeColor-2-0">
text
<br>
</span>
</p>
<p style="text-align:justify;">
<span class="ms-rteThemeForeColor-2-0">
<br>
</span>
</p>
<p style="text-align:justify;">
<span class="ms-rteThemeForeColor-2-0">
text
</span>
<br>
</p>
</div>
</p>
My code to format HTML:
if (!HtmlNode.ElementsFlags.ContainsKey("p"))
HtmlNode.ElementsFlags.Add("p", HtmlElementFlag.Closed);
else
HtmlNode.ElementsFlags["p"] = HtmlElementFlag.Closed;
if (!HtmlNode.ElementsFlags.ContainsKey("span"))
HtmlNode.ElementsFlags.Add("span", HtmlElementFlag.Closed);
else
HtmlNode.ElementsFlags["span"] = HtmlElementFlag.Closed;
if (!HtmlNode.ElementsFlags.ContainsKey("div"))
HtmlNode.ElementsFlags.Add("div", HtmlElementFlag.Closed);
else
HtmlNode.ElementsFlags["div"] = HtmlElementFlag.Closed;
var htmlDoc = new HtmlDocument();
htmlDoc.OptionFixNestedTags = true;
htmlDoc.OptionWriteEmptyNodes = true;
htmlDoc.LoadHtml(myHtml);
foreach (var eachNode in htmlDoc.DocumentNode.SelectNodes("//*"))
{
var count = 0;
foreach (var attr in eachNode.Attributes)
if (attr.Name.ToLower() != "href" && attr.Name.ToLower() != "src" && attr.Name.ToLower() != "alt" && attr.Name.ToLower() != "style")
{
attr.Name = "feeds" + count.ToString();
attr.Value = "";
count++;
}
}
var htmlError = htmlDoc.ParseErrors.SafeAny();
if (!htmlError)
myHtml = htmlDoc.DocumentNode.InnerHtml;
However, the Htmlagilitypack is displaying HTML slightly compared to the initial HTML.
HTML after formatted by Htmlagilitypack:
<p>
<div>
<b>text:</b>
<img alt="" feeds0="" src="/PublishingImages/imageName.png?RenditionID=" feeds1="" feeds2="" style="BORDER: 0px solid; " />
</div>
<div /> <!---não devia estar fechado-->
<b>text:</b>
<div style="text-align:justify;" />
<div style="text-align:justify;"> <!---não devia estar aberto-->
<p style="text-align:justify;">
<span feeds0="">
<br />
text
</span>
</p>
<p style="text-align:justify;">
<br />
<span feeds0="">
text
<br />
</span>
</p>
<p style="text-align:justify;">
<span feeds0="">
<br />
</span>
</p>
<p style="text-align:justify;">
<span feeds0="">
text
</span>
<br />
</p>
</div>
</p>
Why does this happen and how can I fix this? I have already found that if I comment on the following code the HTML is properly formatted:
if (!HtmlNode.ElementsFlags.ContainsKey("div"))
HtmlNode.ElementsFlags.Add("div", HtmlElementFlag.Closed);
else
HtmlNode.ElementsFlags["div"] = HtmlElementFlag.Closed;
But why? I can’t uncomment this code because if HTML has one div
badly closed, I will have problems later on and therefore it will have to be all closed.
NOTE: where is <div /> <!---não devia estar fechado-->
should just be <div>
as in the first HTML I receive.