Regex - Remove between tag and class name start and last tag close

Asked

Viewed 359 times

0

I have a string (html) and need to remove everything that is between the first occurrence of <div class="c and first tag closure > and last closure of "</div>". The first, it must be this way because the class of this div is generated dynamically, remaining only the first character.

For example: <div class="c2029" style="font-size:45px"><p class="auto">Testando 123...</p></div> should be transformed into <p class="auto">Testando 123...</p>

I tried it this way, but it’s removing the whole string:

var testString = '<div class="c2029" style="font-size:45px"><p class="auto">Testing 123...</p></div>'
var result = testString.replace(/\<div\_c.*\>/, '');

Edited

If the string has a line break, the solution stops working:

var testString = `<div class="c892"><h3>Título teste</h3>
Descrição após quebra de linha.</div>`
var result = testString.replace(/<div class="c.*?>(.*?)<\/div>/, '$1');

console.log(result);

Jsfiddle

As Peter had informed in his own reply, it was only to add [\s\S] with the following result:

var result = testString.replace(/<div class="c.*?>([\s\S]*?)<\/div>/, '$1');
  • Read the Anderson Woss comment at this link, where the recommended is to use DOM for HTML Parsing and not Regex.

  • @danieltakeshi so, I’m trying to remove just this div and keep all the internal content

1 answer

3


Although we know very well which is the classic answer for people trying to process HTML using regular expressions, we also have the next answer in the same question, which adds an interesting point.

For point cases where I need to extract or work some data in a simple way on an HTML text, it is often much faster and more practical to produce a regular expression that does the work for me than to use an HTML parser. I see no problem in using regex in this kind of situation.

Clarified this, the answer:

var testString = '<div class="c2029" style="font-size:45px"><p class="auto">Testing 123...</p></div>'
var result = testString.replace(/<div class="c.*?>(.*?)<\/div>/, '$1');

console.log(result);

The regular expression itself:

<div class="c.*?>(.*?)<\/div>

Explanation:

  • <div class="c.*?> - Here is used a Lazy quantifier (.*?) to capture the initial pattern and stop at the first occurrence of the tag closure >.
  • (.*?)<\/div> - We use the Lazy quantifier in a capture group and end with the closing tag of the div.
  • Finally, we use the replace() keeping the catch group 1, using the marker $1.

Updating

According to the OP, it seems that the desired answer was another, since there are situations where the <div> closing does not appear (which was not specified in the question).

Solution 2:

<div class="c.*?>(((?!<\/div>)[\s\S])*)(<\/div>)?

This regular expression was adjusted so that it could consider the new situation and also the possibility of line breaks.

Demonstration: regex101.com

Explanation:

  • <div class="c.*?> - This is the start of the specified pattern capture. captures any text until the tag closes >.
  • (((?!<\/div>)[\s\S])*) - This is already a slightly more complex trick. The pattern (?!<\/div>) is a Lookahead which checks that the previous match is not followed by the pattern <\/div>. Then I capture the next character that is and not a blank (given by default [\s\S]), that is, any character after that assertion. It is necessary to check first and capture later, because if it were the other way around ([\s\S](?!<\/div>)), the last character before the pattern that should not be captured would also not be captured (You can check how this occurs by changing the regex101 demonstration). In the end, I put this in a capture group and had it repeat the same pattern zero or more times, resulting in: (((?!<\/div>)[\s\S])*).
  • (<\/div>)? - Finally, I capture the closing pattern of the div, marking it as optional with the quantifier ?. That way, even if the closure doesn’t exist, there won’t be any problem.
  • they say God cries when a programmer uses Regex to parse an html...

  • 1

    They say there is a hidden passage in the book of Proverbs: "The perverse man processes HTML with regex, but the rectum considers using a parser. There is no wisdom, no intelligence, no advice for those who blaspheme the Lord by reading context-free languages with regular grammars".

  • @Peter would know to tell me why it doesn’t work when there’s some " n" inside that div?

  • Is that the . regex does not include line break characters and tabs. If you want your regex to work with these characters, replace the (.*?) for ([\s\S]*?). This will capture any possible character, including those I mentioned.

  • https://jsfiddle.net/9w60cz0v/1/ seems not to work

  • 1

    @Julyanofelipe Your example does not work because there is no closing tag for div in the string. See again: https://jsfiddle.net/9w60cz0v/2/

  • @Pedro yes I understand, but it escapes from the context of the question, that there was the closing of the div

  • @Julyanofelipe I don’t understand. If the situation you just commented on is outside the context of the question, then why did you cancel my answer? After all, she answered the original question.

  • @Peter I’m sorry, I totally bugged kkkk now that I’m seeing you were right. I’ve been trying to solve this question for so long I didn’t even know what I was reading and doing

  • @Pedro friend, I know this is an old question, but I have reached the other obstacle. If you can help me, I will be editing the question.Thank you

  • 1

    @Julyanofelipe I can help, yes. What’s the new problem?

  • @Peter thank you, was in your answer already my other question :) I edited my question with the answer

Show 7 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.