Regular expression for e-mail validation with REGEXP_LIKE on Oracle

Asked

Viewed 1,776 times

2

I am trying to introduce an additional validation in my regular expression in Oracle 11G using the REGEXP_LIKE.

I want the expression to detect two consecutive underscores but to ignore if they are interspersed. I just want to do this validation in the local part of the domain, that is, after the arroba and before the first point ..

Ex:

  • blablabla@sapo_gmail.com will be a valid email;
  • blablabla@stack_sapo_gmail.com will be a valid email;
  • blablabla@sapo__gmail.com will be a rejected email.

The expression I have right now is as follows::

'^[a-zA-Z0-9_+-]+[a-zA-Z0-9._+-]*[a-zA-Z0-9_+-]+@[a-zA-Z0-9_+-]+[a-zA-Z0-9._+-]*[.]{1,1}[a-zA-Z]{2,}$'

Query I use to validate:

WITH T1 AS (

SELECT 'blablabla@sapo_gmail.com' EMAIL FROM DUAL
UNION

SELECT 'blablabla@stack_sapo_gmail.com' EMAIL FROM DUAL
UNION

SELECT 'blablabla@sapo__gmail.com' EMAIL FROM DUAL
UNION

SELECT ' ' EMAIL FROM DUAL
)

SELECT EMAIL,ROWNUM

FROM T1 

WHERE 1=1

AND NOT (REGEXP_LIKE (EMAIL,'^[a-zA-Z0-9_+-]+[a-zA-Z0-9._+-]*[a-zA-Z0-9_+-]+@[a-zA-Z0-9_+-]+[a-zA-Z0-9._+-]*[.]{1,1}[a-zA-Z]{2,}$')

AND LENGTH (EMAIL) > 0)

1 answer

2

First of all, you can simplify your expression a little. Instead of a-zA-Z0-9 it is possible to simply use [:alnum:].

And [.]{1,1} means "at least 1 and at most 1 endpoint" - meaning quantifier {1,1} is redundant and may be omitted. The use of brackets (also called "character class") is useful when there is more than one character possible (as you do for [a-zA-Z], for example). When there is only one character, it is not necessary to use brackets.

And the point can be written as \. (the point alone has special meaning - it is equivalent to "any character (except line breaks)" - and for it to be considered only the character "point", we must escape it with \).

Finally, to test the LENGTH field is also not required. regex already starts with [a-zA-Z0-9_+-]+ (the + in the end means "one or more occurrences of certain thing", which already guarantees that it will have at least one character). And this repeats for more than once, besides having the @ and the {2,} (two or more occurrences), which already guarantees at least a few more characters. If the field has fewer characters than required, regex fails, then it is unnecessary to check the size of it.


Another point is about using regex to validate emails. The subject is very broad and there are many possibilities. In practice, you must find a balance between the complexity of regex and the correctness of the results. If for your data a simpler regex already solves, there is no problem. But if too simple a regex ends up accepting invalid emails, it doesn’t do much good either.

Your regex, for example, starts with [a-zA-Z0-9_+-]+, which means you will accept emails as [email protected]. It is up to you to decide whether this is acceptable or not (depending on the data that will be consulted, it may not make a difference, so each case is a case).

Anyway, about using regex to validate emails, there are a few more things here, here, here and here (the latter has some regex options at the end, just do not recommend the latter). This article also has some options, and see how the regex starts more or less simple and gets more and more complicated.


Regardless of the regex you choose, I suggest you query using two expressions: one to check the email and one to check that there are no two _ followed. Example:

WITH T1 AS (
  SELECT 'blablabla@sapo_gmail.com' EMAIL FROM DUAL
  UNION
  SELECT 'blablabla@stack_sapo_gmail.com.br' EMAIL FROM DUAL
  UNION
  SELECT 'blablabla@sapo__gmail.com' EMAIL FROM DUAL
  UNION
  SELECT '[email protected]__ponto.com' EMAIL FROM DUAL
  UNION
  SELECT ' ' EMAIL FROM DUAL
)
SELECT EMAIL,ROWNUM
FROM T1 
WHERE
REGEXP_LIKE (EMAIL,'^[[:alnum:]_+-]+[[:alnum:]._+-]*[[:alnum:]_+-]+@[[:alnum:]_+-]+(\.[[:alnum:]_+-]+)*\.[a-zA-Z]{2,}$')
AND
NOT REGEXP_LIKE(EMAIL, '^[^@]+@[^_.]*__[^.]*\..*$')

With this, the first regex checks if the email is valid (and you can continue using yours or switch to any of the suggested links I passed above). The second regex checks if there are two _ followed after the @.

I used a character class denied, delimited by [^ and ]. This works as the opposite of []: while [a-z] is "a letter of a to z", [^a-z] is "any character that nay be a letter of a to z". In this case, regex means:

  • ^: string start
  • [^@]+: one or more characters that are not @
  • @: the character itself "arroba"
  • [^_.]*: zero or more characters that are not _ nor . (ensuring that I only get the __ before the first point)
  • __: two characters _ followed (which can also be exchanged for _{2}, if you think it’s more readable)
  • [^.]*: zero or more characters that are not .
  • \..*: the point character itself (\.) followed by "anything" (.*)
  • $: string end

That is, the regex checks if there are two _ followed in some position after the @, as long as it’s before the first point. Some denied character classes may be redundant, because the first regex has already checked the format, but I prefer to make it very explicit what I am checking.

In this case, the returned emails will be blablabla@sapo_gmail.com, blablabla@stack_sapo_gmail.com.br and [email protected]__ponto.com. See this example on SQL Fiddle.


Instead of [:alnum:], it is also possible to use the shortcut \w, which is equivalent to [a-zA-Z0-9_]. Notice the difference between \w and [:alnum:] is that \w also considers the character _.

The only detail is that in Oracle cannot be used \w (or any of these shortcuts) within the brackets. Usually - in other languages/Engines - it is possible to do [\w+-], for example, which would be equivalent to [a-zA-Z0-9_+-], but in the Oracle the \w does not work inside the brackets. But you can do something similar using alternation: (\w|[+-]) - the character | means or, then this regex would be "a \w or one [+-] (which in turn means 'a + or a -')".

Therefore, the query could also be like this:

WITH T1 AS (
  SELECT 'blablabla@sapo_gmail.com' EMAIL FROM DUAL
  UNION
  SELECT 'blablabla@stack_sapo_gmail.com.br' EMAIL FROM DUAL
  UNION
  SELECT 'blablabla@sapo__gmail.com' EMAIL FROM DUAL
  UNION
  SELECT '[email protected]__ponto.com' EMAIL FROM DUAL
  UNION
  SELECT ' ' EMAIL FROM DUAL
)
SELECT EMAIL,ROWNUM
FROM T1 
WHERE
REGEXP_LIKE (EMAIL,'^(\w|[+-])+(\w|[.+-])*(\w|[+-])+@(\w|[+-])+(\.(\w|[+-])+)*\.[a-zA-Z]{2,}$')
AND
NOT REGEXP_LIKE(EMAIL, '^[^@]+@[^_.]*__[^.]*\..*$')

The result is the same as the previous one (see SQL Fiddle).

Browser other questions tagged

You are not signed in. Login or sign up in order to post.