Regex to extract HTML information

Question

Regex to extract HTML information

Asked 6 years, 11 months ago

Viewed 141 times

0

I am trying to extract information that comes from reading email. However when you pass the match line, it pops the following error:

{"analyzing "(?si:(Information Type[ d]+(?[ d]+)|Information Type(?[ d]+))) " - Invalid group name: group names must start with an alphabetical character."}

I’ve done several tests and I haven’t been able to identify, if anyone has an idea thank you.

string texto = @"<P CLASS=CS95E872D0><SPAN CLASS=CSE27513221><SPAN STYLE='FONT-SIZE:10.0PT'>&NBSP;</SPAN></SPAN><O:P></O:P></P>
<P CLASS='CS95E872D0'><SPAN CLASS='CSE27513221'><SPAN STYLE='FONT-SIZE:10.0PT'>TIPO DE INFORMAÇÃO: INFORMAÇÃO A SER RECUPERADA</SPAN></SPAN><O:P></O:P></P>
<P CLASS='CS95E872D0'><SPAN CLASS='CSE27513221'><SPAN STYLE='FONT-SIZE:10.0PT'>PERIODO: &NBSP;31/10/2013 A 31/10/2018</SPAN></SPAN><O:P></O:P></P>";

string pattern = @"(?si:({0}[^\d]+(?<Tipo de Informação>[\d]+)|{0}(?<Tipo de Informação>[\d]+)))";

pattern = string.Format(pattern, "Tipo de Informação");

Match match = new Regex(pattern).Match(texto);

5

I recommend reading of Why Regex should not be used to handle HTML?

– Woss

2019/01/03 at 12:25
And "Type of information" is exactly what you are trying to capture or is just an example... besides probably not need the regex for this, it is impossible to help without something concrete

– Leandro Angelo

2019/01/03 at 14:37

1 answer

Browser other questions tagged c# regex

You are not signed in. Login or sign up in order to post.

by Reiksiel • **1,471** points · Answer 1 · 2019-01-03T16:08:08+00:00

Although it is not recommended to use Regex for very large text (usually html is mt large), it is possible to use yes since you do not create very complex expressions (always use fixed text in regex that helps a lot =).

According to MSDN documentation: Regular Expression Grouping Constructs

(?subexpression)
or: (? 'name'subexpression)
where name is a valid group name and subexpression is any valid regular expression pattern. name should not contain punctuation characters or start with a number.

Space also counts as scoring! =/

I refilled your regex to work:

string pattern = @"(?si:({0}[^\d]+(?<TipoDeInformacao>[\d]+)|{0}(?<TipoDeInformacao>[\d]+)))";

Good luck!

Tip: Use the Visual Studio extension for Regex, because in it you can do various configuration tests and generate 1 example code ^^.