Although we know very well which is the classic answer for people trying to process HTML using regular expressions, we also have the next answer in the same question, which adds an interesting point.
For point cases where I need to extract or work some data in a simple way on an HTML text, it is often much faster and more practical to produce a regular expression that does the work for me than to use an HTML parser. I see no problem in using regex in this kind of situation.
Clarified this, the answer:
var testString = '<div class="c2029" style="font-size:45px"><p class="auto">Testing 123...</p></div>'
var result = testString.replace(/<div class="c.*?>(.*?)<\/div>/, '$1');
console.log(result);
The regular expression itself:
<div class="c.*?>(.*?)<\/div>
Explanation:
<div class="c.*?>
- Here is used a Lazy quantifier (.*?
) to capture the initial pattern and stop at the first occurrence of the tag closure >
.
(.*?)<\/div>
- We use the Lazy quantifier in a capture group and end with the closing tag of the div
.
- Finally, we use the
replace()
keeping the catch group 1, using the marker $1
.
Updating
According to the OP, it seems that the desired answer was another, since there are situations where the <div>
closing does not appear (which was not specified in the question).
Solution 2:
<div class="c.*?>(((?!<\/div>)[\s\S])*)(<\/div>)?
This regular expression was adjusted so that it could consider the new situation and also the possibility of line breaks.
Demonstration: regex101.com
Explanation:
<div class="c.*?>
- This is the start of the specified pattern capture. captures any text until the tag closes >
.
(((?!<\/div>)[\s\S])*)
- This is already a slightly more complex trick. The pattern (?!<\/div>)
is a Lookahead which checks that the previous match is not followed by the pattern <\/div>
. Then I capture the next character that is and not a blank (given by default [\s\S]
), that is, any character after that assertion. It is necessary to check first and capture later, because if it were the other way around ([\s\S](?!<\/div>)
), the last character before the pattern that should not be captured would also not be captured (You can check how this occurs by changing the regex101 demonstration). In the end, I put this in a capture group and had it repeat the same pattern zero or more times, resulting in: (((?!<\/div>)[\s\S])*)
.
(<\/div>)?
- Finally, I capture the closing pattern of the div
, marking it as optional with the quantifier ?
. That way, even if the closure doesn’t exist, there won’t be any problem.
Read the Anderson Woss comment at this link, where the recommended is to use DOM for HTML Parsing and not Regex.
– danieltakeshi
@danieltakeshi so, I’m trying to remove just this div and keep all the internal content
– Julyano Felipe