Validate URL with ER

Asked

Viewed 1,452 times

1

I have a validation class and need a method to validate URL’s, but the function filter_var contains flaws to validate them.


An example of 3 Urls:

The URL is complete returns TRUE
#1 'http://www.youtube.com' | string(22) "..."

Invalid URL and still function returns TRUE
#2 'tp://www.youtube.com' | string(20) "..."

The URL returns FALSE
#3 'youtube.com' | bool(false)


I do not know if the fault is with the protocol HTTP|HTTPS, I haven’t run out of tests yet.
The rules for validating URL’s from what I’ve seen are huge and I don’t quite understand all the rules.

I thought I’d use preg_match before filter_var to find the protocol with the ER:
"/(http|https):\/\/(.*?)$/i".

My fear is that it will also fail.
Someone has a simple suggestion for this impasse - other than a complex ER?

3 answers

4


For the record, there is no To regular expression to validate Urls and guilt is in part/are of Rfc(s). And so is for any data that relies on a.

The filtering functions of PHP even follow the proper specifications, but they do not cover all cases and, for others, in order not to avoid false-positives, it decreases the restriction according to your need, through the configuration flags, allowing you to have the flexibility needed for each case.

Only for future references, by default, if omitted the second argument, it only treats the data as a common string.

In your case, given the absence of the form of use, I imagine you’re doing this:

filter_var( 'http://www.youtube.com', FILTER_VALIDATE_URL );

The first URL validates because it contains the main elements of a URL which are the schema, the domain and the TLD.

In the second case it also validates because it also has the three basic Components, even if one of them is wrong.

So that the second URL also returned FALSE would need to match the first flag with FILTER_FLAG_SCHEME_REQUIRED.

The third URL is valid for the user, for the browser, but not for RFC because it lacks one of the basic components required by the specification.

What you could do is, like everything that comes from the user, before you even validate it, is sanitize the URL. A few things that occur to me:

  • Check if there are no schemas broken, as in the second URL and fix them, either by removing or fixing when and if possible
  • Add the schema pattern http:// at the beginning of the missing URL (or qebrado and now removed), after all, an FTP or HTTPS URL (or ED2K, Magnet, torrent...) that does not have such specific prefixes will not be treated as special anyway.

And always warn the user through a tip on GUI that the format is http://domain.com. If he type wrong, the system fails to fix the check fails, warned he went and will have to fill it up again.

  • Good class this, I will do a close reading and then comment on the pc. I’m using the cited flag... I forgot to put it in the question. thanks

  • +1... I use JS to make the mask, but the validation is on the server side and returns the messages back... With so many possibilities, sanitize the URL input, you wouldn’t have to know all the rules to make no mistake?

  • 1

    As a programmer, I am quite pessimistic. I always think of the worst possible scenario and at the end of these scenarios there is always a user (>.<). If term and small performance losses are not a problem for you it is worth focusing on the RFC 3696 Section 4.2 (unless mistake) and others that it derives if the case is and try to cover as much of the cases as possible and if you (your program) can not solve by itself, unfortunately return the error to the user, do what.

3

Your second example is a URL valid! As Urls have the general format:

esquema://máquina/caminho/recurso

http and https are just two examples of schema (schema; sometimes called "protocol"). Others would be the ftp, the file... Nothing stops someone from creating a scheme tp, so the validator accepted his second example.

If you want to restrict the scheme to http and https, I suggest simply testing why soon after the filter_var:

strpos($url, "http:") === 0 || strpos($url, "https:") === 0

(Note: why I do not test only if the prefix is http? Because that would accept Urls as httpabc://...)

  • I did not know this about schema, I know about ftp, but not that it could be created. I will read more about this, thanks.

  • I find it safer to compare with FALSE and not with zero, just in case.

  • @Brunoaugusto Sorry, I don’t understand. Actually, I don’t have much experience with PHP, I copied this implementation of "startsWith" of that reply in Soen. By my understanding, when the search term is a string prefix this function will return zero, not false, or am I mistaken? Second that documentation, it will only return false if the substring nay is found.

  • 1

    So, is that PHP is full of "achismos", meddling with cast automatic and such. With the string functions that can return FALSE common sense is to compare, negatively, to FALSE (strpos( $s, 'string' !== FALSE )) because if something influences the variable (here, $s) may be that before strpos() it produces a false positive.

3

I use this regex and am satisfied with the results

(preg_match("%^(?:(?:https?|ftp)://)(?:\S+(?::\S*)?@|\d{1,3}(?:\.\d{1,3}){3}|(?:(?:[a-z\d\x{00a1}-\x{ffff}]+-?)*[a-z\d\x{00a1}-\x{ffff}]+)(?:\.(?:[a-z\d\x{00a1}-\x{ffff}]+-?)*[a-z\d\x{00a1}-\x{ffff}]+)*(?:\.[a-z\x{00a1}-\x{ffff}]{2,6}))(?::\d+)?(?:[^\s]*)?$%iu", $url)))

You can find regex at https://gist.github.com/dperini/729294

  • In the question I said, that it is not complex... If this is simple then I do not know what is complicated :) I am kind of layman with ER

  • Dear friend, I tried to help. ER is also not my strong suit, that’s why I trust this ER that DPERINI keeps and updates on github. When I can’t solve it, I prefer tested code. Sorry to occupy your post. Spend well.

  • Every explanation of the rule is in the address I have placed. As well as comments and suggestions. The rule is not mine.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.