Regex - Remove between tag and class name start and last tag close

Question

Regex - Remove between tag and class name start and last tag close

Asked 7 years, 6 months ago

Viewed 359 times

0

I have a string (html) and need to remove everything that is between the first occurrence of <div class="c and first tag closure > and last closure of "</div>". The first, it must be this way because the class of this div is generated dynamically, remaining only the first character.

For example: <div class="c2029" style="font-size:45px"><p class="auto">Testando 123...</p></div> should be transformed into <p class="auto">Testando 123...</p>

I tried it this way, but it’s removing the whole string:

var testString = '<div class="c2029" style="font-size:45px"><p class="auto">Testing 123...</p></div>'
var result = testString.replace(/\<div\_c.*\>/, '');

Edited

If the string has a line break, the solution stops working:

var testString = `<div class="c892"><h3>Título teste</h3>
Descrição após quebra de linha.</div>`
var result = testString.replace(/<div class="c.*?>(.*?)<\/div>/, '$1');

console.log(result);

Jsfiddle

As Peter had informed in his own reply, it was only to add [\s\S] with the following result:

var result = testString.replace(/<div class="c.*?>([\s\S]*?)<\/div>/, '$1');

Read the Anderson Woss comment at this link, where the recommended is to use DOM for HTML Parsing and not Regex.

– danieltakeshi

2018/02/01 at 14:00
@danieltakeshi so, I’m trying to remove just this div and keep all the internal content

– Julyano Felipe

2018/02/01 at 14:02

1 answer

Browser other questions tagged javascript jquery regex

You are not signed in. Login or sign up in order to post.

by Pedro Corso • **563** points · Answer 1 · 2018-02-01T15:49:00+00:00

Although we know very well which is the classic answer for people trying to process HTML using regular expressions, we also have the next answer in the same question, which adds an interesting point.

For point cases where I need to extract or work some data in a simple way on an HTML text, it is often much faster and more practical to produce a regular expression that does the work for me than to use an HTML parser. I see no problem in using regex in this kind of situation.

Clarified this, the answer:

var testString = '<div class="c2029" style="font-size:45px"><p class="auto">Testing 123...</p></div>'
var result = testString.replace(/<div class="c.*?>(.*?)<\/div>/, '$1');

console.log(result);

The regular expression itself:

<div class="c.*?>(.*?)<\/div>

Explanation:

<div class="c.*?> - Here is used a Lazy quantifier (.*?) to capture the initial pattern and stop at the first occurrence of the tag closure >.
(.*?)<\/div> - We use the Lazy quantifier in a capture group and end with the closing tag of the div.
Finally, we use the replace() keeping the catch group 1, using the marker $1.

Updating

According to the OP, it seems that the desired answer was another, since there are situations where the <div> closing does not appear (which was not specified in the question).

Solution 2:

<div class="c.*?>(((?!<\/div>)[\s\S])*)(<\/div>)?

This regular expression was adjusted so that it could consider the new situation and also the possibility of line breaks.

Demonstration: regex101.com

Explanation:

<div class="c.*?> - This is the start of the specified pattern capture. captures any text until the tag closes >.
(((?!<\/div>)[\s\S])*) - This is already a slightly more complex trick. The pattern (?!<\/div>) is a Lookahead which checks that the previous match is not followed by the pattern <\/div>. Then I capture the next character that is and not a blank (given by default [\s\S]), that is, any character after that assertion. It is necessary to check first and capture later, because if it were the other way around ([\s\S](?!<\/div>)), the last character before the pattern that should not be captured would also not be captured (You can check how this occurs by changing the regex101 demonstration). In the end, I put this in a capture group and had it repeat the same pattern zero or more times, resulting in: (((?!<\/div>)[\s\S])*).
(<\/div>)? - Finally, I capture the closing pattern of the div, marking it as optional with the quantifier ?. That way, even if the closure doesn’t exist, there won’t be any problem.