Removing strings from html files

Asked

Viewed 525 times

4

In my project I need to read the contents of an HTML file as said in that question of mine. What happens is I can scan the file, but there’s a comment I wanted to take out.

The peculiarity is that this comment is always changeable, so how can I get c# to remove all the code snippets that appear ?

The comment that appears is this one:

<!-- saved from url=(0103)https://sistema.registrocivil.org.br/buscas/certidoes2aViaGerarXmlBusca.cfm?pedido_certidao_id= -->

Is there any way I can get these comment elements out of HTML(<!---->) and everything that’s inside that element ? Because it’s always changeable, you can’t use Replace and take.

I wonder if someone could help me ?

2 answers

4


You can remove comments using Regex, as follows:

string semComentarios = Regex.Replace(stringHtml, @"<!--(.*?)-->", String.Empty);

See an example working in that fiddle.

  • That’s right, this is the best solution for prolema.

  • So it doesn’t capture multi-line comments. You have to use RegexOptions.Singleline to handle the input stringHtml as if it were a single line.

  • @Marcusvinicius only an addendum, it would be possible to use a regex to remove tags from HTML, like <html>,</body>, leaving XML content there intact ?

  • Although the comment is directed to Marcus, I say yes. Just follow the model of Marcus. Regex is simple, there is no complication. You will replace what is in the expression with another that in the case of Marcus is a string.Empty, that is, empty. Then just use your imagination you go far. Good code.

  • @pnet I have tried, only what happens is that it takes all tags, including xml tags, and I need to keep them, if not my logic of reading the page does not work. = (. Could you help ? If you know anything, it would be of great help! I didn’t want to open another question, because I think it’s the same theme that you have here. =(

  • 1

    It is better to open another question, otherwise the guys will block the question, since this is more direct, ie HTML tags and BODY. Trust me, open a specific one. Here, it is better to open several questions than to try to amend one, then the downvotes will run loose until the blocking of the question. Because you have already closed and many people do not see questions with answers already closed.

  • Well, I created a new question. If you can help me, any solution would be welcome!

Show 2 more comments

2

Do the following :

    int indexIni = 0;
    int indexFim = 0;
    string html = @"<div> dfklçbndflçbndfblçdfnmblçdfmblçdfmblçdmfblçdfbmç<!-- saved from url=(0103)https://sistema.registrocivil.org.br/buscas/certidoes2aViaGerarXmlBusca.cfm?pedido_certidao_id=21716443 --></div>";

    Console.Write(html);

    indexIni = html.IndexOf("<!--");

    indexFim = html.IndexOf("-->");

    if (indexIni != -1 && indexFim != -1)
    {
        html = html.Remove(indexIni, Math.Abs(indexIni - (indexFim + 3)));
    }

    Console.Write(html);
  • Any doubt in the application of the solution I am available.

  • If he uses Asp.net tags, the comments would be @**@ and the example of Luã would not contemplate it. I think that by regex would be more feasible, not that the Luã is wrong at all, it’s just that I would have to define more variable and by regex, I just add in the expression what I want to remove. Just a hint.

  • No problem @pnet, unfortunately my knowledge of "regex" is quite limited so I do not show such a solution. But it really would be a better solution.

  • No, Luã, that’s not it, I just meant that if the guy has another comment string, he’ll always have to create a new variable, but his solution is okay, I just wanted him to take that into account.

  • Not quiet, I did not take your comment seriously would be a better solution. Constructive criticisms are always welcome. It is nois.

  • @Luãgovindamendessouza your solution is great! You can use a good one here in my scenario!

Show 1 more comment

Browser other questions tagged

You are not signed in. Login or sign up in order to post.