Doubt Regex functionality - Java

Asked

Viewed 277 times

3

Could someone explain to me what this Regex allows?

private static final String MAIL_PATTERN = "^[_A-Za-z0-9-\\+]+(\\.[_A-Za-z0-9-]+)*@[A-Za-z0-9-]+(\\.[A-Za-z0-9]+)*(\\.[A-Za-z]{2,})$";
  • 1

    Some sites like Regexr and Regex101 explain all components of a regular expression and still allow you to test online

  • 1

    Maybe your question is too wide. I suggest taking a look at tutorials like this and this. Then you can [Edit] your question so that it becomes more specific about some particular point at which you were in doubt.

  • @hkotsubo my question is wide? I want to know what that Regex above does.

  • 1

    Notice that I wrote "Maybe". I am still in doubt, because initially I thought I would need to write a regex tutorial to answer properly (and so it is considered broad, not pq is asking several things, but pq the answer would be too long). But maybe I can make a not so long answer, I’ll still see it later (if no one answers before, of course). Anyway, if you can read the links that Leonardo and I put together, maybe you can understand the basics and focus your doubts on more specific points. This helps to make it "less broad"

  • 1

    Sure, I’m very grateful for the @hkotsubo strength! Big hugs.

  • In the end, I thought it was worth posting an answer. Even though it was very big, it is very focused on its regex and its use to validate emails (there is no regex tutorial, until pq vc is using few syntax resources), and in the background what matters is not the size of the answer, but the focus, then I guess I was wrong to consider the question "too wide". Have fun and welcome to the strange/wonderful world of regular expressions :-)

  • Only complementing, in this other answer I put some (very complicated) expressions to validate emails - I didn’t test with Java, only with Javascript, but it shouldn’t be "difficult" to adapt them, I think :-)

Show 2 more comments

1 answer

4


Basically, this regex checks whether a String corresponds to an email address. But it may also end up accepting "weird" things (more on that at the end).

Analyzing the regex in detail:

^[_A-Za-z0-9-\\+]+(\\.[_A-Za-z0-9-]+)*@[A-Za-z0-9-]+(\\.[A-Za-z0-9]+)*(\\.[A-Za-z]{2,})$

^ and $ sane markers which means, respectively, the beginning and end of the string. This ensures that the string will only have what is in regex (without these markers, it is possible to verify that only part of the string corresponds to regex).

The clasps ([]) define a character class: they serve to indicate that you want any character inside them. For example, [abc] means "the letter a or the letter b or the letter c" (only one of them, any one serves).

But inside the brackets you can use some shortcuts, like A-Z, which means "letters of A to Z" (capital letters). Similarly, a-z means "lowercase letters of a to z" and 0-9 means "digits of 0 to 9".

That is to say, [_A-Za-z0-9-\\+] means "the character _ or letters (upper or lower case) or digits or hyphenate (-) or the plus sign (+)". The two inverted bars would be to escape the +, but I think inside the brackets is not necessary. In my tests it made no difference, but anyway, I left it so (the regex syntax says that only one \ is used for escape, but as we are inside a String, we need to put \\).

The detail is that "anything within brackets" is an expression that corresponds to only one character. If you want more occurrences, you should use the quantifiers, and that’s what we did by putting a + after the clasps:

[_A-Za-z0-9-\\+]+
                ^

Outside the brackets, the + means "one or more occurrences" than is immediately before it. In this case, it is one or more occurrences of "character _ or letters (upper or lower case) or digits or hyphenate (-) or the plus sign (+)". That is, strings like abc, A124_fadfd-a12 and even +_a are considered valid by this expression.


Next we have (\\.[_A-Za-z0-9-]+)*. Let’s start with what’s inside the parentheses:

  • \\.: corresponds to the dot character (.). It is escaped with backslash because the dot has special meaning in regex (meaning "any character"). Using the backslash, he "loses his powers" and becomes a common character
  • [_A-Za-z0-9-]+: similar to the previous case, is one or more occurrences of _, or letters, or numbers, or hyphens

Together, these 2 parts above are "a dot, followed by one or more letters/numbers/hyphens/underscore".

Only all this is in parentheses, and then there’s a *. That means "zero or more occurrences of what is inside the parentheses". That is, this sequence "one point, followed by one or more letters/numbers/hyphens/underscore" can occur several times (or none). We can have strings like .abc.cde123.fgh or just have nothing.

Joining this with the previous expression, we have the first part of the email (before the @), which can be from a username common (such as joaosilva) even more complicated things like 32teste.abd2-cdef12_4232.xyz.afd.

Then we have a @, which corresponds to the character itself "arroba".


Now we have the second part of the email, after the @.

First we have [A-Za-z0-9-]+ (one or more letters/numbers/hyphens).

Then we have (\\.[A-Za-z0-9]+)*, whose logic we have already seen. In this case, it means "zero or more repetitions of (point followed by letters/numbers)". This serves for domains such as algum.nome.comprido.com.br: the passages .nome, .comprido and .com correspond to this expression.

Finally, we have (\\.[A-Za-z]{2,}). The quantifier {2,} means "two or more occurrences". Therefore, this expression means "point, followed by 2 or more letters". With this, the email domain cannot end with .a or .b, for example, you need to have at least two letters.

Joining the 3 expressions above, we have the email address domain, which can be since gmail.com to addresses with several sub-domains, such as a333-bcd.abc.co.uk.


About the validation of emails

Unfortunately, validating emails is not as simple as it seems. There are too many rules (each part of the email has a size limit, the domain can also be an IP address, etc.), and the more accurate the regex is, the more complicated it becomes - see these examples, just to get an idea. And the simpler (or less complicated), the greater the chance of false positives.

The regex in question, for example, accepts emails such as [email protected] and [email protected] (see here working). This is because we use the hyphen inside the brackets and with quantifiers (one or more occurrences), so a String with several hyphens is considered valid.

Similarly, in the first pair of brackets there is a character +, and these brackets have a quantifier + (one or more occurrences), so regex considers that [email protected] is a valid email (see here). Technically, according to RFC 5322, I believe it is, but it is up to you to decide whether your system will accept such addresses.

Another problem is that it does not check the size limits of each part, nor does it accept IP addresses in place of the domain (just to name a few). The fact is, for every rule you try to put, the harder it gets: see at this link how regex gets more complicated every time we add a rule.

In the end, you should find a suitable medium for your use cases. A very complicated regex can become a future maintenance problem (take one of the expressions of the indicated links; if it is already difficult to understand, imagine if you have to modify to add a new case or fix some bug).

Consider the pros and cons: simplifying regex means giving up precision in exchange for ease of maintenance (but you get more false positives). And making it more precise can lead to those complicated codes that everyone is afraid to move.

How bad it will be to accept some invalid emails (or "strangers", such as [email protected] or [email protected], or [email protected], all considered valid by the regex in question)? How bad is it not to accept some valid cases? And what are valid cases? user@localhost may be a valid case, depending on the situation, but this regex considers it invalid. Some Internet solutions consider that the first part (before the @) can also have characters like %, $ and ! (that its regex does not consider).

In order not to accept "strange" cases (or at least minimize them), you can change the regex to accept emails that start only with letters, for example, putting this condition right after the ^:

^[A-Za-z][_A-Za-z0-9-\\+]+.... (o resto é igual)
 ^^^^^^^^

With this, the first character must be a letter, and the rest can be "one or more letters/numbers/hyphens/underscore/plus sign" etc. regex will still accept [email protected], but you can add a rule that does not accept two + in a row, etc (notice the pattern "the more rules, the more complicated it gets"?).

It is up to you to test, analyze your use cases and decide by a path. Regex is an extremely powerful tool (and in my opinion, much legal), but is not always the best solution for everything.

Perhaps a simpler solution is to separate into several expressions (makes a split in the @ and checks each part with a different regex), which can be easier (to do, understand and maintain) than a super-regex-monstrous-do-it-all. Another solution is to use one external lib.


Just remember that regex does not check if the email exists (if there is an account for that user on that server, if the account is active, if someone actually reads the emails sent to this address, etc).
Perhaps in the end it is easier to use a simpler regex (that checks something that seem an email, without worrying about the most complicated rules) and then you send an email with a confirmation/activation link, as many websites often do.


Some useful tutorials on regex:

  • 1

    Thank you very much hkotsubo! Helped too much.

  • 1

    @Andréfilipe You’re welcome! If you think that the answer actually answers/solves the question, you have the option to accept it: see here how and why to do it. Of course you are not obliged to accept any answer (only if you think you really answered the question), but this serves to indicate to future visitors that the answer solves/answers to what was asked. And good studies with regex, because it’s an endless subject!

  • Sorry, I even voted for your reply on Friday, but I closed the browser quickly. I believe that’s why it wasn’t accounted for!

  • 1

    @Andréfilipe No problem!

  • 1

    I have a lot of systems I have to modify after reading that answer.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.