Pick URL with regular link expression

Asked

Viewed 1,974 times

5

I am trying to extract only the URL if an expression is validated with the tag [monitory].

The expression I use is this:

(?=<a.*\[monitory\].*href=["|'][http:|https:]?[\/\/]).*?["|']>

And for example on a link like this:

<a [monitory] href="http://www.google.com">Google</a>

Extract only the address:

http://www.google.com

2 answers

6


First let’s see some details on its regex (and also suggestions to improve it).

Use .* is always tempting, but "dangerous", since it is an expression that means "zero or more occurrences of any character". In addition, the quantifier is greedy, that is, it will try to get as many characters as possible.

This means that if your string has two links on the same line, the first one will be ignored. For example, if the string is:

<a [monitory] href="http://www.link1.com"><a [monitory] href="http://www.link2.com">

Only the address http://www.link2.com will be considered as the .* takes as many characters as possible (including the entire "link1.com."). See here this regex working.

To cancel greed, put a ? shortly after the *:

<a.*?\[monitory\].*?href=["|']([http:|https:]?[\/\/]?.*?)["|']>

Thus the .* takes up as little as necessary, causing both "link1" and "Link2" to be captured by regex. See here the difference.


Another detail is that ["|'] is a character class, that is, it accepts all the characters that are in the brackets. So this expression means the character " or the character | or the character '. That means the string could have | instead of quotation marks:

<a [monitory] href=|http://www.teste.com|>

And yet the regex would accept, see here.

If you want me to have only " or ', remove the | square bracketed: ["'].

Similarly, [\/\/] means the character / or the character / (that is, it is redundant to have twice the same character inside the brackets - and in some languages this even gives error). This causes regex to accept only one bar in the URL (http:/www.teste.com), see here an example.

If you want two occurrences of /, remove the brackets.

The excerpt [http:|https:]? should also be removed from the brackets, for reasons already explained above. In fact the regex only works because both this stretch and the [\/\/] have a ? soon after, which makes them optional, and after them has a .*?, which corresponds to any characters. To better understand, place parentheses around each of them and see the section that each one captures.

To accept http or https, just do https?: the stretch s? makes the letter s optional. Then the regex would be:

<a.*?\[monitory\].*?href=["'](https?:\/\/.*?)["']>

See here her working.


Oh yes, this regex only works if [monitory] be before href, and if right after the quotes that close the href have no space. You can improve a little more, changing .*? for \s+ (one or more space occurrences) and at the end, before closing the tag, put \s* (as it may have zero or more spaces before the >):

<a\s+\[monitory\]\s+href=["'](https?:\/\/.*?)["']\s*>

See here this regex working.


Note that this has no end, as HTML tags are more complex than they appear. If you guarantee that your strings always have this format and there are no more variations, regex solves it. But if you have more cases (href before monitory, other attributes, URL has protocols like ftp, Gopher, mailto, or is simply localhost, etc), you will have to update the regex.

The use of .* makes invalid Urls accepted, such as http:///#@@#@#@ or even http:// (see here). If you really want to validate any URL, you will end up with monstrous expressions like this, and then it’s not worth using something so complicated.

Regex, although it’s very nice, is not the best tool for Parsing and HTML manipulation. Maybe it’s time to try most suitable tools.


I understand that the its regex has worked, but the problem of regex isn’t just making it work for valid cases, it’s making it work too nay work for invalid cases.

  • 2

    Our well explained. I will use this expression to monitor the links accessed by making a redirect later. Thank you

  • Just complementing, this in email marketing.

2

Looking at another question made on stackoverlow I understood one thing.

The ( parentheses serve to capture so I modified my expression to:

<a.*\[monitory\].*href=["|']([http:|https:]?[\/\/]?.*?)["|']>

And now it’s worked.

The other question finds in: Regular expression for picking text part

Browser other questions tagged

You are not signed in. Login or sign up in order to post.