If it is the users themselves who put the phones in the comments, then there is not much control over the format.
Of course you can consider some more common formats. By their examples, I saw that can be the 9 digits of the mobile number, all together or separated by spaces (999999999
, 9 9999 9999
or 99999 9999
), the DDD being optional.
To Reply from @Lipespry suggests a very complex regex to contemplate these (and many other) cases, but the regex syntax supported by Mysql unfortunately is somewhat limited and will not support all proposed resources such as \d
to represent digits or the lookaheads and lookbehinds (the passages that begin with (?=
, (?!
and (?<!
).
Therefore, it follows an alternative that checks some formats:
select * from tabela where comentario REGEXP
'(^|[^0-9])(\\(?0?[0-9]{2}\\)?)?9 ?[0-9]{4} ?[0-9]{4}([^0-9]|$)';
(^|[^0-9])
: the |
means "or". Therefore, this passage means "string start" (^
) or "anything other than a number" (the [^
means that I don’t want what comes after, I mean, I don’t want 0-9
- no digit from zero to 9).
This ensures that so far I can be at the beginning of the string (going that the phone is already at the beginning), or has any character that is not a number (avoiding that take cases like 3393333333333333333
).
Next we have (\\(?0?[0-9]{2}\\)?)?
. We’ll go in pieces, inside out:
0?[0-9]{2}
- an optional zero (0?
- the ?
indicates "zero or an occurrence", which is the same as saying "optional"), followed by 2 digits ([0-9]
is any digit from 0 to 9, and {2}
what to say "two occurrences"), because the DDD can be written as 11
or 011
\\(?
and \\)?
- parentheses can be optional. I did so because only parentheses (
and )
have special significance in regex as they serve to group sub-expressions. So we have to escape them with \\
.
- finally, this entire passage is in parentheses (i.e., grouped in a single sub-expression), and the
?
at the end makes all this stretch be optional.
That is, the DDD is optional.
Then we have 9 ?
, which is number 9 followed by an optional space (note that there is a space before the ?
, that is, the space that is optional, not the 9
). Here I am assuming that it will only be mobile numbers that start with 9 - remembering that in the future we can have cell phones that start with 8, 7, etc, so it is up to you to always leave 9
or switch to [0-9]
(or [7-9]
if you want to start with only 7, 8 or 9, etc).
Then we have [0-9]{4}
(4 digits), followed by an optional space plus 4 digits.
And finally, we have ([^0-9]|$)
: any character other than a number or the end of the string ($
). This also ensures that you will not pick more digits than necessary, avoiding that take for example 3393333333333333333
.
In this Sqlfiddle you can see this query working.
If you want to increment, you can put the separator as a hyphen or space, for example, so numbers like 9 9123-4567
or 99123-4567
. Just exchange the optional spaces for [ \\-]?
(an optional space or hyphen). regex would look like this:
select * from tabela where comentario REGEXP
'(^|[^0-9])(\\(?0?[0-9]{2}\\)?)?9[ \\-]?[0-9]{4}[ \\-]?[0-9]{4}([^0-9]|$)';
See here her working.
It is also possible to add an optional space after the DDD:
(^|[^0-9])(\\(?0?[0-9]{2}\\)?)? ?9 ?[0-9]{4} ?[0-9]{4}([^0-9]|$)
^^
Without this space, the DDD is ignored for cases like (11) 9 9123 4567
- only the phone number is captured by regex, but the DDD is not, see here an example. Already putting the optional space, the DDD is also captured, see here the difference.
Another detail is that we are only considering cell phone numbers. But there are still 8-digit phone numbers (in homes not so much, but in businesses, it’s still quite common). If you also want to consider these numbers, just put 9 as optional:
(^|[^0-9])(\\(?0?[0-9]{2}\\)?)? ?9? ?[0-9]{4} ?[0-9]{4}([^0-9]|$)
^^
Just remembering that, as are the users who type their numbers, there may always be some strange format that you did not foresee. And the more possibilities, the more complex the regex becomes.
For example, the regex I suggested only considers an optional whitespace. But if you want more than one space, you can change the ?
for *
(zero or more occurrences), or limit the amount with keys (for example, {0,3}
limits between 0 to 3 occurrences).
There is also the possibility of a CPF being mistaken for a phone, since both can be written without any separator (43912341222
can be either a CPF or a DDD + phone - even if people usually write the CPF as 439.123.412-22
, Who says you won’t have a case like this? Anyway, see if this applies to your cases). Anyway, regex is not such a "magical" business, and it is up to you to assess whether you will trust her so much that you automatically remove anything she picks up...
Perhaps it is best to follow the suggestion given in Reply from @Lipespry and check this out before to insert into the database. I don’t know what language you are using, but most of them have more modern regex Engines, which allows you to write expressions like this, for example (which uses \b
to delimit the phones, without using the "trick" I did above with (^|[^0-9])
and ([^0-9]|$)
, besides using \d
as a shortcut to [0-9]
and \s
into spaces).
One way to do this is this(not done VERY well but resolve): ( ([0-9]{2,3} )[0-9 ]{8,11}|[0-9]{8,11}|[0-9 ]{8,14})
– Guilherme Barros