How to make a regular expression that finds a name and then looks for a character?

Asked

Viewed 6,439 times

1

I was analyzing an extensive html code that basically contains this format:

<span id="mensagem" class="topo">Classes e comandos</span>

The problem is that the amount of arguments within span varies in quantity and position

The goal is to get the set "Classes and commands".

For that, I need that when the search finds the sequence "message", look for the next character ">" and when you find it, take the string of characters in front that are different from the character "<".

Thus:

           (achou)--------------v(achou) 
<span id="mensagem" class="topo">Classes e comandos</span> 
                                 ||||||||||||||||||x(chega nesse e para)
                                   (pega esses) 

Just need to express this in regular expression. I am using Notepad++, someone would know to formulate a regular expression for this problem?

  • Which programming language you are using ?

  • I am using Notepad++, more specific: Find, Find with the "regular expression" option selected.

  • Try the following: <span[^>]+>(.*?)<\/span>

  • See working here

  • He is very close to this friend Wéllington. In the site regex101 he separates the text in Group 1, but in Notepad ++ he.

  • Recommended reading: https://stackoverflow.com/a/1732454/4438007

Show 1 more comment

2 answers

2

Answer
As mentioned by the Wellington user, you should follow the steps:

Go to Search-> Replace.
Set the value of the Search/Find field: (<.*?(?=mensagem).*?>)(.*?)(<.*?>)|(.*)
Set the value of the Replace field to: 2 or $2.
Set the search mode to: Regular expression.
Click the button: Replace everything.

This will replace the entire text with content that has the keyword message within the tag.

You can test this regex here.

If you have not solved your problem comment here what you expected, what happened wrong and try to solve, I hope to have helped :D

Explanation by Regex
This regex has 4 groups of catches, I will explain what each one does so that I can better understand

(<.*?(?=mensagem).*?>)

The group 1 will capture everything that is between the tag, if you have the message word in any position before the character ">", for that I used a Positive Lookahead, it determines that everything between (?= and ) is a condition for catching what is before.

(.*?)

The group 2 will only be triggered if group 1 captures something, since it is in the same expression and is not after an operator OR, it captures anything but line breaks and stops as soon as another character of the next expression is found.

(<.*?>)

The group 3 captures everything that is between the tags after group 2, the tag "<" also serves as a limiter for group 2 to stop capturing when they find it.

|(.*)

The group 4 is an expression that is after the operator OR, this means that if regex does not capture with the previous expression, it will try to capture with that, so just insert an operator "." to capture any character other than line break (\n), then anything that does not match your search will be deleted by replacing everything with group content 2.

1

Follow the steps:

Go to Search-> Replace.
Set the value of the Search/Find field: <span[ >]+>(.*?)</span>
Set the value of the Replace field with: 1 or 2.
Set the search mode to: Regular expression.
Click the button: Replace everything.

Remembering that it will leave only the found result, example:

<div>
   <span id="mensagem" class="topo">Texto 01</span>
   <span id="mensagem" class="topo">Texto 02</span>
   <span id="mensagem" class="topo">Texto 03</span>
   <span id="mensagem" class="topo">Texto 04</span>
   <span id="mensagem" class="topo">Texto 05</span>
</div>

Just stay:

Texto 01
Texto 02
Texto 03
Texto 04
Texto 05
  • As explained in the question, span can come with arguments in different positions, names and sizes: (examples): <span id="message" dir="auto" class="auto-Stamp"> <span id="message" dir="" class="Limited"> #FEARING WAIVER TOGETHER WITH THE ECONOMIC TEAM VCS ARE A FIASCO </span> <span dir="" class="change-Scope" id="message" > already got </span> These are just 3 example of an endless list of tags (which all the time change size, names and positions... unfortunately it doesn’t work for this problem.

  • Look, what do you want is to get the text inside the correct tag? Or something more specific ? Because the code is working yes watch video

  • I took the test, and replacing <span[^>]+> by just <span> all tags are equal, IE, regular expression works perfectly. To be frank I never really understood it until I saw this example. I made several TAG Span with various attributes and ran the replacement and worked matching all.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.