Read HTML file

Asked

Viewed 2,559 times

2

In my project I need to read an HTML file that in source code has a structure of an xml. I need to read this HTML file, get the value of xml tags that have there do a whole process to save this data in my database....

Read an xml, my system reads nicely, but I need my system to be able to read an HTML file as well.

How can I do that? I have no idea where to start.

Structure of my HTML file

<html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"></head><body><certidao>
<subtag></subtag>
<subtag></subtag>
<subtag></subtag>
<subtag></subtag>
<subtag></subtag>
<subtag></subtag>
<subtag></subtag>
<subtag></subtag>
<subtag></subtag>
<subtag></subtag>
<subtag></subtag>
<subtag></subtag>
<subtag></subtag>
<subtag></subtag>
<subtag></subtag>
<subtag></subtag>
<subtag></subtag>
</certidao>
</body></html>

I need to read everything inside the root tag certidao and disregard HTML tags

The html page is saved on the computer and there is no need to access the link but the file path.

  • Reading HTML or XML is analogous. You have an XML reading code?

  • To make it easier for gypsies, is there a way that I can remove the html tags by leaving the certificate tag ? What happens is that in the source code of the HTML page, that is, the content of the HTML page, is an xml, that is, everything that is inside the certificate tag....

  • The system cannot open the file, edit it by removing the unwanted part and then save the file as an XML?

  • It is because the logic I have here @Andrewpaes reads an xml, only the logic has changed, now I need to read an html file that has as content xml.

  • It would not be possible to use the XDocument or XmlDocument to read the file? From there, it is simply necessary to extract the contents of the <certidao>.

  • Unless XmlDocument needs the <?xml?> upstairs.

  • I already use Xdocument @brazilianldsjaguar. Only that when it comes to iterating on the elements, it goes straight through and does not read the tags.

  • I get it. Can you post this code? (and sorry Portuguese, not my native language!)

  • I’m doing the same here in that question of mine only giving the Replaces to take the html tags and leaving only the xml ones, that is, the ones inside the certificate.

Show 4 more comments

1 answer

3


You can use the Htmlagilitypack

PM> Install-Package HtmlAgilityPack

Follow a small example code

HtmlDocument doc = new HtmlDocument();
doc.Load("arquivo.html")
foreach (HtmlNode certidao in doc.DocumentNode.SelectNodes("//certidao"))
    foreach (HtmlNode subtag in certidao.SelectNodes("//subtag"))
        Console.WriteLine(subtag.InnerText);

Have an example with your modified data on Dotnetfiddle

Browser other questions tagged

You are not signed in. Login or sign up in order to post.