REGEX extract everything from a group

Asked

Viewed 76 times

2

I’m trying to extract some information from an HTML, but the regex doesn’t work, I’m pretty sure it’s not picking up spaces, something like that.

https://regex101.com/r/LPJO4Z/1

<g-card(.*?)>(.*?)<\/g-img>

If I remove <\/g-img> The end works, but it doesn’t get the whole group I want it to be <g-card>...</g-img>.

  • If it’s an HTML, why don’t you do it for the DOM?

  • 1

    I’m not using PHP, I’m doing it in Java

  • 1

    gives a read here: https://stackoverflow.com/q/457684/5988245 tmj

  • 1

    @Lipespry caramba, I used Jsoup, it was even better + help from the regex guy was perfect, thank you.

  • 2

    It’s good that I was helpful! And in no way I wanted to belittle the reply of Master @hkotsubo! The point is that parsing HTML with Regex when there is a lib for this is "shot in the foot" as much as it works! But anyway, your question is about regex. ;D

  • 2

    @Lipespry I’ve been more radical and I thought never we should use regex to work with HTML (it’s just "hunting" some old comments of mine around). But lately I’m getting more flexible, and depending on the case, it might be a valid solution. If it’s a specific excerpt (some tags, few nesting levels, etc.), controlled inputs (no CDATA and other more complex structures), and vc make some concessions and even accept some false positives, I think OK. That said, for the specific case of this question, perhaps a parser HTML is more suitable... :-)

  • 2

    @hkotsubo I confess that I share your opinion. But Regex is more for static cases, which will only run once, where has a fixed pattern and/or that does not depend on "tinkering with Regex" for each text... Particularly, short "play with Regex". But in the end, the HTML parser (lib) is safer. ;

  • 1

    Well at the end of everything I used the 2 methods and got the result exactly as I wanted, I had made a regex for everything, but it didn’t work should be some limit in bytes since it was an HTML.

Show 3 more comments

1 answer

3


By default, the point does not consider line breaks, then .* only goes until the next line break and can not go forward (and as the g-img is not on the same line as g-card, he finds nothing).

Many languages and tools have an option to change this behavior, and is usually called "DOT_ALL" or "single line" (which is a confusing name for what the option does, but anyway).

On the website regex101.com just choose the option "single line" (click on the flag in the right corner, just after the regex), this will make the point also consider line breaks. Behold that with this option enabled, are found the pouch correctly.

Another option is to exchange the point for [\s\S]:

<g-card([\s\S]*?)>([\s\S]*?)<\/g-img>

Basically, \s corresponds to "spaces, TAB and line breaks" (the exact meaning changes from one language/engine/tool to another, but line breaks are always considered) and \S is "anything that is not \s". That is to say, [\s\S] is "everything that is and that is not line breaks": basically a "turbine point", because it corresponds to any character, including line breaks (regardless of the option "single line" be activated or not, see).


Anyway, use regex to manipulate HTML is not always the best solution. Often a parser HTML is the best option.

  • 1

    [\s\S] was enough for +1! kkkk

  • 1

    @Lipespry The first time I saw this "trick" didn’t seem to make sense, but after I understood, I thought it was pretty cool :-)

  • 1

    Damn you saved me +1

Browser other questions tagged

You are not signed in. Login or sign up in order to post.