Regex to remove HTML Entity

Question

Regex to remove HTML Entity

Asked 6 years, 4 months ago

Viewed 239 times

1

I have an excerpt of HTML in which I extract some information. However in one of these HTML is coming the code &#43 in the middle, which means the sign +. I tried a few ways but no removing that character.

Is there any idea how to get specific HTML code?
Today happens this but may have problems with others too, example sign -.

What I’ve already tried:

string texto = "<td>Para maiores informações consulte: &#43; informações</td>";
string novoTexto = Regex.Replace(texto, "[;\\/:*?\"<>|&']", string.Empty);

3

Why remove? Wouldn’t it be better to simply do Decode? https://stackoverflow.com/q/19692654

– hkotsubo

2019/04/11 at 16:50
opa valeu hkotsubo, had forgotten this possibility! It worked here for me, if you want to post as answer, I leave as solved

– aa_sp

2019/04/11 at 17:28
It took me a while, but I put an answer :-)

– hkotsubo

2019/04/12 at 00:28
thanks, thank you for your time in having responded :)

– aa_sp

2019/04/12 at 10:45

1 answer

Browser other questions tagged c# asp.net regex

You are not signed in. Login or sign up in order to post.

by hkotsubo • **55,826** points · Answer 1 · 2019-04-12T00:27:31+00:00

A HTML Entity is perfectly valid information in an HTML and there is no reason to remove it.

What you can do is Decode of his, using HttpUtility.HtmlDecode (available in namespace System.Web), or WebUtility.HtmlDecode (available in namespace System.Net):

string texto = "<td>Para maiores informações consulte: &#43; informações</td>";
Console.WriteLine(HttpUtility.HtmlDecode(texto));
Console.WriteLine(WebUtility.HtmlDecode(texto));

Both produce the same result:

<td>Para maiores informações consulte: + informações</td>

But if you want remove the HTML Entities (and not replace them with the equivalent characters), so just use:

Console.WriteLine(Regex.Replace(texto, "&[^;]+;", string.Empty));

regex contains the character & at the beginning and the ; at the end. Among them, there is:

[^;]: the [^ creates a character class denied, that is, this excerpt represents any character that is not inside the brackets. Therefore, this excerpt means "any character that nay be it ;"
the quantifier + means "one or more occurrences"

Therefore, regex means: the character &, followed by one or more characters other than ;, followed by ;. With that, all the HTML Entities are eliminated. The output is:

<td>Para maiores informações consulte:  informações</td>

Just to explain why your regex didn’t work.

[;\\/:*?\"<>|&']: the brackets define a character class, which corresponds to any character between brackets. Therefore, this regex means "the character ;, or the character \, or the character /, or the : etc...". The detail is that this whole expression corresponds to only one character (and this may be any of those listed).

Therefore, this regex only deletes these characters. In the case of HTML Entity, only the & and the ; are deleted, but the numbers and the # nay.