Regex to remove HTML Entity

Asked

Viewed 239 times

1

I have an excerpt of HTML in which I extract some information. However in one of these HTML is coming the code &#43 in the middle, which means the sign +. I tried a few ways but no removing that character.

Is there any idea how to get specific HTML code?
Today happens this but may have problems with others too, example sign -.

What I’ve already tried:

string texto = "<td>Para maiores informações consulte: &#43; informações</td>";
string novoTexto = Regex.Replace(texto, "[;\\/:*?\"<>|&']", string.Empty);
  • 3

    Why remove? Wouldn’t it be better to simply do Decode? https://stackoverflow.com/q/19692654

  • opa valeu hkotsubo, had forgotten this possibility! It worked here for me, if you want to post as answer, I leave as solved

  • It took me a while, but I put an answer :-)

  • thanks, thank you for your time in having responded :)

1 answer

2


A HTML Entity is perfectly valid information in an HTML and there is no reason to remove it.

What you can do is Decode of his, using HttpUtility.HtmlDecode (available in namespace System.Web), or WebUtility.HtmlDecode (available in namespace System.Net):

string texto = "<td>Para maiores informações consulte: &#43; informações</td>";
Console.WriteLine(HttpUtility.HtmlDecode(texto));
Console.WriteLine(WebUtility.HtmlDecode(texto));

Both produce the same result:

<td>Para maiores informações consulte: + informações</td>

But if you want remove the HTML Entities (and not replace them with the equivalent characters), so just use:

Console.WriteLine(Regex.Replace(texto, "&[^;]+;", string.Empty));

regex contains the character & at the beginning and the ; at the end. Among them, there is:

  • [^;]: the [^ creates a character class denied, that is, this excerpt represents any character that is not inside the brackets. Therefore, this excerpt means "any character that nay be it ;"
  • the quantifier + means "one or more occurrences"

Therefore, regex means: the character &, followed by one or more characters other than ;, followed by ;. With that, all the HTML Entities are eliminated. The output is:

<td>Para maiores informações consulte:  informações</td>

Just to explain why your regex didn’t work.

[;\\/:*?\"<>|&']: the brackets define a character class, which corresponds to any character between brackets. Therefore, this regex means "the character ;, or the character \, or the character /, or the : etc...". The detail is that this whole expression corresponds to only one character (and this may be any of those listed).

Therefore, this regex only deletes these characters. In the case of HTML Entity, only the & and the ; are deleted, but the numbers and the # nay.

  • 1

    I like to read your answers about Regex. If you write or have already written something on the subject gives me name of the book I buy.

  • 1

    @Augustovasques Thank you! I have written a book, but it is not about regex (see in my profile, it is about a subject that do not appear so many questions, compared to regex...). But thank you for the comment, it’s a sign that my studies are paying off :-)

Browser other questions tagged

You are not signed in. Login or sign up in order to post.