Catch everything between line break tags

Question

Catch everything between line break tags

Asked 6 years, 10 months ago

Viewed 208 times

2

<div class="1" >quero pegar

    AQUI DENTRO PODE TER
    quebra de linha...
    paragrafo...
    espaços
    várias outras tags html...
    qualquer coisa...

</div>

How do I marry this div with regex? Example: https://regexr.com/4c1n0

2

You might want to use an HTML parser. If you can have anything inside the tag, that means you can also have other Ivs, then you will need to know if you are inside the internal or external div to know where to stop, and although it is possible (with recursive regex, which not all languages support), it is much easier using the parsers available in each language :-)

– hkotsubo

2019/04/10 at 16:55
no, I don’t need it in regex for something bigger, it’s just an example, what interests me is the idea.

– Rod

2019/04/10 at 16:56
and including div? I edited the question, let’s say there’s a pattern in that div with id=1, there’s only one.

– Rod

2019/04/10 at 16:58
Possible duplicate of Why Regex should not be used to handle HTML?

– danieltakeshi

2019/04/10 at 17:01
edited what would like, is something simple.

– Rod

2019/04/10 at 17:05
for example, if I do <div class="1" >.*</div> and I can marry everything if I’m on the same line, but give a break line no longer house.

– Rod

2019/04/10 at 17:07
1

Pq does not take innerHTML from div?

– Sam

2019/04/10 at 17:12
see the example in the question p or favor.

– Rod

2019/04/10 at 17:12
1

I’ve seen it. It’s easier to get HTML Inner from div. You’re using which language?

– Sam

2019/04/10 at 17:13
That I’m going to do later, I need to get about 20 patterns on a page, after possession of that 20, I’m going to take only what’s inside what I want.

– Rod

2019/04/10 at 17:14
I do not know yet, for now I just want to marry this div so there yes, separately can treat what is inside the tags.

– Rod

2019/04/10 at 17:16

Show 6 more comments

2 answers

3

^{The ideal is to use a parser HTML, as it can handle all possible and valid cases of HTML syntax, which are much more difficult to deal with regex. That said, let’s go to some alternatives...}

By default the point does not consider line breaks, therefore <div class="1">.*</div> does not work. An alternative is to use this regex:

<div[^>]*>([\s\S]*?)<\/div>

Shortly after <div we have [^>]*: a sequence of zero or more characters that are not >. Although the point (.*) can work, you better say exactly what you want, since the point may exceed the > and test the characters you have after (and then regex can come back, and do this back and forth several times, until you find a match). Already using [^>] I guarantee she stops when she finds the first >.

Then we have [\s\S]. The shortcut \s (lower case) means "spaces, TAB, line breaks, etc" (the exact list varies according to the language). And \S (capital) means "anything other than \s". That is to say, [\s\S] is "whatever \s or that is not \s", what is another way of saying "any character, including line breaks" (basically a "point turbine").

Next we have the quantifier * (zero or more occurrences). If you want to force the tag to have something inside it, you can exchange it for +. The problem is that they are "greedy" and try to get as many characters as possible. So I use the ? soon after to cancel this behavior (with this, the regex takes as few characters as possible to satisfy it). The difference happens when there is more than one div in the text. Example:

<div class="1" >div1</div>
<div class="1" >div2</div>

If I use <div[^>]*>([\s\S]*)<\/div> (without the ?), regex picks up both Ivs at once, as the * is greedy and picks up as many characters as possible. This causes him to take the two Ivs together, see here.

Putting the ? (that is to say, <div[^>]*>([\s\S]*?)<\/div>), the * stops being greedy and picks up as few characters as possible. With this, he picks up the two Divs separately, see here.

If you do not want to use capture groups, you can switch to regex:

(?<=<div[^>]*>)[\s\S]*?(?=<\/div>)

In this case we use lookbehind and loohahead (the passages with (?<= and (?=). The difference is that they only check if something exists before or after, but these passages are not part of the match (see here). But the lookbehind is a little more inefficient, and not all Engines accept expressions of variable size in one lookbehind (in the case, the [^>]*), then it is probably best to use the first solution, with the capture group.

But there is still a problem. If you have a div inside another:

<div class="1">
abc
<div>div interna</div>
xyz
</div>

The above regex only picks up the first </div>, leaving "Xyz" out (see here). And if I take the ? of regex, I return to the previous problem, which is to take several Divs at once (see here).

Then it starts to complicate, and maybe in this case it’s easier to use a parser HTML, because it already handles these cases for you. A regex for this is not impossible, but it is much more complicated, as it would have to use recursive regex:

<div[^>]*>((?:(?R)|(?:(?!<\/?div)[\s\S]))*)<\/div>

The problem is that not all languages and Engines support recursive regex (the chunk with (?R), calling the regex itself recursively). Basically, regex checks whether there are other Ivs or other non-Divs tags inside it, ensuring that you only take what’s inside the outermost div (see here this regex working).

Again, evaluate your use cases and see if it’s worth using these expressions (and if the simplest one already fits your cases). In some situations, using regex is acceptable, but depending on how complex your HTML is, one parser is the best option.

A recursive regex above, for example, still lets pass this case:

<div class="1">abc
<!-- 
comentario </div>
-->
xyz
</div>

The regex cannot identify that </div> is within a comment and should be ignored (and as you said that within the div can have "anything", I decided to include this example, even if it is "rare"). With this, the "Xyz" section is left out (see here).

Already one parser HTML can ignore comments without major problems.

1

Dude, I’m gonna study a lot about this. To do what I want to do is really very sweet yet. Thank you.

– Rod

2019/04/10 at 18:34
indicates me a good book about regex?

– Rod

2019/04/10 at 18:42
@Rod For study, two sites that I quite like are that and that. And books, I recommend that (that is well dense and goes deep even in the subject) and that.

– hkotsubo

2019/04/10 at 18:46

Browser other questions tagged html regex

You are not signed in. Login or sign up in order to post.

by Marciano Machado • **2,154** points · Answer 1 · 2019-04-10T17:20:40+00:00

3

I used the following regex:

\<div.*\>((.*|\s|\r|\n)*)\<\/div\>/gm

See the example below:

const texto = `<div class="1" >quero pegar

    AQUI DENTRO PODE TER
    quebra de linha...
    paragrafo...
    espaços
    várias outras tags html...
    qualquer coisa...
	<p> Teste </p>

	<p><span> Teste <span> </p>
</div>`;

const regex = /\<div.*\>((.*|\s|\r|\n)*)\<\/div\>/gm;
 

console.log(texto.replace(regex,  '$1'));

See the regex: https://regex101.com/r/KCm86L/1

you’re getting everything https://regexr.com/4c1n0

– Rod

2019/04/10 at 17:22
replace for the $1 group

– Marciano Machado

2019/04/10 at 17:24
perfect. No regexr where I’m using treats different, but that’s just what I wanted. Thanks friend.

– Rod

2019/04/10 at 17:26
just one last question, in a language with PHP or ASP how do I recover that $1? Have any links?

– Rod

2019/04/10 at 17:27
https://regex101.com/r/KCm86L/1/ click on code Generator, you can choose the language

– Marciano Machado

2019/04/10 at 17:28
If you insert another div, take everything, even the one outside the div https://regex101.com/r/KCm86L/2

– Rod

2019/04/10 at 17:28
@Rod I put a reply with an alternative to when you have more than one div :-)

– hkotsubo

2019/04/10 at 18:33

Show 2 more comments