MYSQL and REGEX Phone Number

Question

MYSQL and REGEX Phone Number

Asked 6 years, 6 months ago

Viewed 1,491 times

6

I need to select records that have in the middle of the phone numbers text to delete (automatic moderation).

It’s a forum, people every day comment and put phone numbers in the middle of the committal (Whats).

Example of formats: 993345987, 9 93345987, (21)932324343, (21)9 3232 4343, etc....

If I could pick up two patterns (922222222, 9 2222 2222) I think it would already work.

I’m doing it like this, very simple:

select * from tabela where comentario REGEXP '[0-9]{9}';

Is returning phone numbers, but other numbers, for example 3333333333333333333333, also return records.

How to limit to bring only numbers that have 9 characters only, so: 966666666.

One way to do this is this(not done VERY well but resolve): ( ([0-9]{2,3} )[0-9 ]{8,11}|[0-9]{8,11}|[0-9 ]{8,14})

– Guilherme Barros

2019/01/22 at 18:08

3 answers

3

Run the query:

select * from tabela where comentario REGEXP '.*[0-9]{2}9[0-9]{8}.*|.*[0-9]{2} 9 [0-9]{4} [0-9]{4}.*';

See practical example on http://sqlfiddle.com/#! 9/225b6cf/2

1

If you have 3393333333333333 (which is clearly not a telephone), your regex accepts: http://sqlfiddle.com/#! 9/bfbe36/1

– hkotsubo

2019/01/22 at 21:08

Browser other questions tagged mysql regex

You are not signed in. Login or sign up in order to post.

by hkotsubo • **55,826** points · Answer 1 · 2019-01-22T23:03:26+00:00

If it is the users themselves who put the phones in the comments, then there is not much control over the format.

Of course you can consider some more common formats. By their examples, I saw that can be the 9 digits of the mobile number, all together or separated by spaces (999999999, 9 9999 9999 or 99999 9999), the DDD being optional.

To Reply from @Lipespry suggests a very complex regex to contemplate these (and many other) cases, but the regex syntax supported by Mysql unfortunately is somewhat limited and will not support all proposed resources such as \d to represent digits or the lookaheads and lookbehinds (the passages that begin with (?=, (?! and (?<!).

Therefore, it follows an alternative that checks some formats:

select * from tabela where comentario REGEXP
'(^|[^0-9])(\\(?0?[0-9]{2}\\)?)?9 ?[0-9]{4} ?[0-9]{4}([^0-9]|$)';

(^|[^0-9]): the | means "or". Therefore, this passage means "string start" (^) or "anything other than a number" (the [^ means that I don’t want what comes after, I mean, I don’t want 0-9 - no digit from zero to 9).

This ensures that so far I can be at the beginning of the string (going that the phone is already at the beginning), or has any character that is not a number (avoiding that take cases like 3393333333333333333).

Next we have (\$?0?[0-9]{2}\$?)?. We’ll go in pieces, inside out:

0?[0-9]{2} - an optional zero (0? - the ? indicates "zero or an occurrence", which is the same as saying "optional"), followed by 2 digits ([0-9] is any digit from 0 to 9, and {2} what to say "two occurrences"), because the DDD can be written as 11 or 011
\$? and \$? - parentheses can be optional. I did so because only parentheses ( and ) have special significance in regex as they serve to group sub-expressions. So we have to escape them with \\.
finally, this entire passage is in parentheses (i.e., grouped in a single sub-expression), and the ? at the end makes all this stretch be optional.

That is, the DDD is optional.

Then we have 9 ?, which is number 9 followed by an optional space (note that there is a space before the ?, that is, the space that is optional, not the 9). Here I am assuming that it will only be mobile numbers that start with 9 - remembering that in the future we can have cell phones that start with 8, 7, etc, so it is up to you to always leave 9 or switch to [0-9] (or [7-9] if you want to start with only 7, 8 or 9, etc).

Then we have [0-9]{4} (4 digits), followed by an optional space plus 4 digits.

And finally, we have ([^0-9]|$): any character other than a number or the end of the string ($). This also ensures that you will not pick more digits than necessary, avoiding that take for example 3393333333333333333.

In this Sqlfiddle you can see this query working.

If you want to increment, you can put the separator as a hyphen or space, for example, so numbers like 9 9123-4567 or 99123-4567. Just exchange the optional spaces for [ \\-]? (an optional space or hyphen). regex would look like this:

select * from tabela where comentario REGEXP
'(^|[^0-9])(\\(?0?[0-9]{2}\\)?)?9[ \\-]?[0-9]{4}[ \\-]?[0-9]{4}([^0-9]|$)';

See here her working.

It is also possible to add an optional space after the DDD:

(^|[^0-9])(\\(?0?[0-9]{2}\\)?)? ?9 ?[0-9]{4} ?[0-9]{4}([^0-9]|$)
                               ^^

Without this space, the DDD is ignored for cases like (11) 9 9123 4567 - only the phone number is captured by regex, but the DDD is not, see here an example. Already putting the optional space, the DDD is also captured, see here the difference.

Another detail is that we are only considering cell phone numbers. But there are still 8-digit phone numbers (in homes not so much, but in businesses, it’s still quite common). If you also want to consider these numbers, just put 9 as optional:

(^|[^0-9])(\\(?0?[0-9]{2}\\)?)? ?9? ?[0-9]{4} ?[0-9]{4}([^0-9]|$)
                                 ^^

Just remembering that, as are the users who type their numbers, there may always be some strange format that you did not foresee. And the more possibilities, the more complex the regex becomes.

For example, the regex I suggested only considers an optional whitespace. But if you want more than one space, you can change the ? for * (zero or more occurrences), or limit the amount with keys (for example, {0,3} limits between 0 to 3 occurrences).

There is also the possibility of a CPF being mistaken for a phone, since both can be written without any separator (43912341222 can be either a CPF or a DDD + phone - even if people usually write the CPF as 439.123.412-22, Who says you won’t have a case like this? Anyway, see if this applies to your cases). Anyway, regex is not such a "magical" business, and it is up to you to assess whether you will trust her so much that you automatically remove anything she picks up...

Perhaps it is best to follow the suggestion given in Reply from @Lipespry and check this out before to insert into the database. I don’t know what language you are using, but most of them have more modern regex Engines, which allows you to write expressions like this, for example (which uses \b to delimit the phones, without using the "trick" I did above with (^|[^0-9]) and ([^0-9]|$), besides using \d as a shortcut to [0-9] and \s into spaces).

by LipESprY • **4,525** points · Answer 2 · 2019-01-22T20:09:50+00:00

"- [...] who have in the middle of the text phone numbers to delete (automatic moderation) [...] It is a forum, people everyday comment and put phone numbers in the middle of the committed"

First, I’ll start with a tip. Automatic moderation suggests that you go automatically delete when "find any occurrence of regular expression".

With a simple regular expression you may end up deleting what should not be deleted!

Following this line of reasoning, you should create an analysis system: when the query with regular expression finds some "suspicion", a human (moderator) will evaluate whether it should be excluded or not. Implement as you see fit for your use, of course!

The first requirement to use regular expression is to understand that they seek by following a (or more) pattern!

When it comes to user entries from your system, you should provide as many patterns as possible to formulate your regular expression.

And not only fetch dois números followed by um 9 followed by oito números or this same reasoning allowing space between numbers.

Users can write a phone number without even following a pattern... What we can do is look for as many patterns as possible and, over time, evolve regular expression. As you see occurrences that were not found by your query...

Following this my theory, I formulated an expression that aims to find some of the most widely used standards:

(?=\d{2,3}9\d+)(?<!\d)(\d{11,12})(?!\d)|(?<!\d)(\d{2}\s\d{5}\s\d{4})(?!\d)|(?<!\d)(0\d{2}\s\d{5}\s\d{4})(?!\d)|(?<!\d)(\d{2}\s9\s\d{4}\s\d{4})(?!\d)|(?<!\d)(0\d{2}\s9\s\d{4}\s\d{4})(?!\d)|(?<!\d)(\d{9})(?!\d)|(?<!\d)(\d\s\d{8})(?!\d)|(?<!\d)(\(\d{2}\)\d{9})(?!\d)|(?<!\d)(\(0\d{2}\)\d{9})(?!\d)|(?<!\d)(\(\d{2}\)\d{1}\s\d{4}\s\d{4})(?!\d)|(?<!\d)(\(0\d{2}\)\d{1}\s\d{4}\s\d{4})(?!\d)|(?<!\d)(\(\d{2}\)\s\d{1}\s\d{4}\s\d{4})(?!\d)|(?<!\d)(\(0\d{2}\)\s\d{1}\s\d{4}\s\d{4})(?!\d)

Immense, right!? Well! Still she can’t find half the countless possibilities!

But I took it as main objective the examples cited in the question and some others that I was thinking as I was developing the expression.

See it working in Regex101.com/r/Zafehd/1/:

Note that in the test itself, I left exactly the expression that meets that occurrence. The expression as a whole is a union of all these little expressions: expressão|outra_expressão|outra_expressão|....

Now just add it to your query and, depending on the amount of records, wait some time to return the occurrences...

Another tip is you filter before inserting in the database. Direct in your application.

It gets a lot more "light" for your server to run this regular expression in a single text (comment) than going through all the records in your database table...

Ex.: I tried to simulate a query on Sqlfiddle in order to supplement this reply, but gave timeout...