Regular expression formatted as internet domain

Asked

Viewed 1,282 times

4

How to create a regular expression to validate an internet domain? The rules are below:

  1. Minimum size 2 and maximum 26 characters;
  2. Valid characters are letters from "a" to "z", numbers from "0" to "9" and the hyphen;
  3. Not only contain numbers;
  4. Do not start or end by hyphenating.

Remember that the validation in question refers between the beginning of the value and the first point. Example: domain.with.br.

Will be used for both URL and email.

Follow the code I made:

([^-](([a-zA-Z0-9]?)*([a-zA-Z-])([a-zA-Z0-9]?))+[^-])\.

As close as I got to the answer, the hyphens need to be removed at the beginning and end:

((([\w]?)+([a-zA-Z-_])([\w]?)+){2,26})\.
  • 4

    These rules may have problems depending on where to apply. They serve well on a DNS entry, but if it’s for user interaction, they’re wrong. Accented characters and other languages are valid in the data input. Conversion to punycode is not the responsibility of the user, but of the system. Another thing, it can not have two points in a row (and two hyphens then usually only at the beginning of the punycode). Example of a link that will be blocked: http://www.estadão.com.br

  • You’re right, @Bob. Accents can (and should) be applied, but I’m not as sure about the point or hyphen followed (programmatically speaking).

  • As close as I got to the answer, the hyphens must be removed at the beginning and end: ((([\w]?)+([a-zA-Z-_])([\w]?)+){2,26})\.

1 answer

1

How about:

^(?!\d+\.)(\w[\w\-]{0,24}\w)\.
  • (?!\d+\.) is a negative Lookahead. It checks that the domain is not composed only of digits.

  • \w the first character cannot be a hyphen, so one expects a letter or a digit.

  • [\w\-]{0,24} after the first character has been proven to be a não hífen, there may be between 0 or 24 letters, numbers or hyphen.

  • \w the last character also cannot be a hyphen.

  • \. beginning of the rest of the domain, in which you are not interested.

In (\w[\w\-]{0,24}\w) you wait 1 caracter não hífen + 0 até 24 letras, números ou hífen + 1 caracter não hífen. These three rules together ensure that your domain will have at least 2 characters and a maximum of 26.

You can see the regex running here.

Remarks:

The above regex does not handle letters in other languages, as commented by @Bacoo, because Ecmascript 5 and down do not offer native support for Unicode regexes (Ecmascript 6 yes).

The same regex could be implemented in PHP (which supports Unicode regexes) as follows:

^(?!\d+\.)([\p{L}0-9][\p{L}0-9\-]{0,24}[\p{L}0-9])\.

where [\p{L}0-9] replaces \w to accept any letters in any languages + digits. You can see this regex working here.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.