An expression between square brackets represents a character class. For example, [abc]
means "the letter to, or the letter b or the letter c" (only one of them).
If you remove the clasps (abc
) then it means "the letter to followed by the letter b, followed by the letter c" (the 3 letters in this order).
And if you have a ^
inside the brackets, means that you are denying what is inside. That is to say, [^abc]
means "anything other than the letter to, nor the letter b, nor the letter c".
Already \w
is a shortcut for [a-zA-Z0-9_]
, i.e., "any uppercase, lowercase, digit or digit letter _
".
Therefore, [^\w]
means "any character other than \w
", and that is why the expression is also taking the character <
(see here an example of this regex working).
And [\w@\w]
would be "a \w
, or the character @
or a \w
". Yeah, the \w
appears twice inside the brackets, which is redundant.
In your specific case you just want what you are <
and >
, then you could use only the characters themselves <
and >
. You don’t have to use \w
, because you already know exactly which are the delimiters (<
and >
), then specifically use these characters.
Among them you could use .+
meaning "one or more occurrences of any character". .*
would also work, but *
means "zero or more occurrences", which would accept the string <>
.
Then the regex would be <(.+)>
:
<
and >
represent the characters themselves <
and >
.+
is "one or more occurrences of any character"
- the parentheses serve to define a catch group, so that it is possible to obtain the corresponding section with the method
match
The code goes like this:
let texto = "teste <[email protected]>";
let email = texto.match(/<(.+)>/)[1];
console.log(email);
Greed
The problem of quantifiers +
and *
is that they are greedy and try to pick up as many characters as possible. In your specific case it makes no difference, but if you have more than one pair of <
and >
, the result may be unexpected:
The captured stretch will be [email protected]> outro <[email protected]
: the quantifier +
take as long as possible that satisfies the regex. As I used .+
, it will take as much as possible of any character, until it finds a >
(he finds the first, but as there is another later, he continues). So he ends up taking more than "should".
In this case, enough cancel greed putting a ?
in front:
With this, the captured email becomes only "[email protected]".
About using regex to validate emails
But if you come in <
and >
will always have an email, maybe regex should be more specific, so you avoid the "false-positive" (get a string that is between <
and >
but it’s not an email). Only then it starts to get too complicated (see this example, just to get an idea). But at least you guarantee that you will only have valid data (avoiding strings like "<123>"
- Assuming you will only accept email addresses).
You just need to see if it’s worth having such a complex regex (think about the future maintenance of this code). Maybe an alternative is to have something in between, like <([^<>]+)>
(<
, followed by one or more occurrences of anything other than <
nor >
, followed by >
).
Or even a more "naive" regex, like <([\w.-]+@[a-z.]+)>
(one or more occurrences of \w
, point or dash followed by @
, followed by one or more occurrences of letters or dots) - it accepts email addresses, but also accepts things like <.@gmail.>
. But at least he won’t accept <123>
(which is something that .+
would accept, for example).
Anyway, there is not much way, the more specific, the lower the chance of false-positive, but greater the complexity.
PS: I also comment more on the use of regex to validate emails in this answer, and in this other also.
After reading this, I came to the conclusion that I know nothing of regex. :)
– Sam
@The more I study regex, the more I come to the same conclusion as you (that I know nothing). It seems an endless subject...
– hkotsubo
Because it’s expensive. It looks like a programming language of its own.
– Sam