Read HTML file

Question

Read HTML file

Asked 10 years ago

Viewed 2,559 times

2

In my project I need to read an HTML file that in source code has a structure of an xml. I need to read this HTML file, get the value of xml tags that have there do a whole process to save this data in my database....

Read an xml, my system reads nicely, but I need my system to be able to read an HTML file as well.

How can I do that? I have no idea where to start.

Structure of my HTML file

<html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"></head><body><certidao>
<subtag></subtag>
<subtag></subtag>
<subtag></subtag>
<subtag></subtag>
<subtag></subtag>
<subtag></subtag>
<subtag></subtag>
<subtag></subtag>
<subtag></subtag>
<subtag></subtag>
<subtag></subtag>
<subtag></subtag>
<subtag></subtag>
<subtag></subtag>
<subtag></subtag>
<subtag></subtag>
<subtag></subtag>
</certidao>
</body></html>

I need to read everything inside the root tag certidao and disregard HTML tags

The html page is saved on the computer and there is no need to access the link but the file path.

Reading HTML or XML is analogous. You have an XML reading code?

– Leonel Sanches da Silva

2015/07/14 at 19:35
To make it easier for gypsies, is there a way that I can remove the html tags by leaving the certificate tag ? What happens is that in the source code of the HTML page, that is, the content of the HTML page, is an xml, that is, everything that is inside the certificate tag....

– Érik Thiago

2015/07/14 at 19:50
The system cannot open the file, edit it by removing the unwanted part and then save the file as an XML?

– Andrew Paes

2015/07/14 at 19:52
It is because the logic I have here @Andrewpaes reads an xml, only the logic has changed, now I need to read an html file that has as content xml.

– Érik Thiago

2015/07/14 at 19:57
It would not be possible to use the XDocument or XmlDocument to read the file? From there, it is simply necessary to extract the contents of the <certidao>.

– brazilianldsjaguar

2015/07/14 at 20:02
Unless XmlDocument needs the <?xml?> upstairs.

– brazilianldsjaguar

2015/07/14 at 20:02
I already use Xdocument @brazilianldsjaguar. Only that when it comes to iterating on the elements, it goes straight through and does not read the tags.

– Érik Thiago

2015/07/14 at 20:03
I get it. Can you post this code? (and sorry Portuguese, not my native language!)

– brazilianldsjaguar

2015/07/14 at 20:05
I’m doing the same here in that question of mine only giving the Replaces to take the html tags and leaving only the xml ones, that is, the ones inside the certificate.

– Érik Thiago

2015/07/14 at 20:06

Show 4 more comments

1 answer

Browser other questions tagged c# html

You are not signed in. Login or sign up in order to post.

by Leandro Amorim • **1,865** points · Answer 1 · 2015-07-14T20:07:40+00:00

You can use the Htmlagilitypack

PM> Install-Package HtmlAgilityPack

Follow a small example code

HtmlDocument doc = new HtmlDocument();
doc.Load("arquivo.html")
foreach (HtmlNode certidao in doc.DocumentNode.SelectNodes("//certidao"))
    foreach (HtmlNode subtag in certidao.SelectNodes("//subtag"))
        Console.WriteLine(subtag.InnerText);

Have an example with your modified data on Dotnetfiddle