Any tips on how to do it in a minimally performatic way
Whenever you want something "performatic", regex usually is not among the first options (for a number of reasons, such as the fact of engine need to compile the expression and various other internal operating details that generate a overhead huge).
Of course, for small strings being checked a few times, the difference will be imperceptible (after all, for few data, everything is fast), but since the "need" for performance has been mentioned, I think it is worth giving alternatives (with and without regex).
The alternatives below assume that the format is always as described: a set of characters always between < >
, may have some others before (and perhaps after?), and at most only an occurrence of only one occurrence of <
and >
(because I also deal with cases where you don’t have).
Regex-free
A simple way is to use indexOf
to obtain the indexes where the <
and >
, and then get the stretch between them with substring
:
function getEmail(s) {
// busca a posição do '<'
var inicio = s.indexOf('<');
if (inicio === -1) // se não tem, já retorna
return undefined;
// busca a posição do '>', mas começando a busca a partir da posição do '<'
var fim = s.indexOf('>', inicio);
if (fim === -1) // se não tem, já retorna
return undefined;
// retorna o trecho entre '<' e '>'
return s.substring(inicio + 1, fim);
}
const strs = [ 'foo <[email protected]>', 'bar <[email protected]>', 'baz <[email protected]> fdafad fad', 'invalid' ];
const emails = strs.map(getEmail);
console.log(emails); // [ '[email protected]', '[email protected]', '[email protected]', undefined ]
Or, if you want to do everything "by hand", just one loop simple:
function getEmail(s) {
var inicio = null, fim = null;
for (var i = 0; i < s.length; i++) {
var c = s[i];
if (c === '<') {
inicio = i;
} else if (c === '>' && inicio !== null) {
fim = i;
break;
}
}
if (inicio !== null && fim !== null)
return s.substring(inicio + 1, fim);
return undefined;
}
But I think the version with indexOf
is simpler (and also faster, as we will see at the end).
Regex
If you really want to continue with regex, you can improve a little (but not too much):
The change is in the "brain": instead of .*
(meaning "zero or more characters"), I used [^>]+
:
- the
[^>]
is a character class denied, meaning "any character that nay be it >
"
- the quantifier
+
means "one or more" (in this context I think it best that *
, which means "zero or more", meaning it also takes cases like <>
- already the +
requires to have at least one character between <
and >
)
The difference occurs because the quantifiers *
and +
are "greedy" (or "greedy") and try to pick up as many characters as possible. In the case, as the point corresponds to any character, then .*
goes to the end of the string (i.e., it also consumes the >
), and if you don’t find one match, he does the backtracking: starts returning characters until you find one match (here has a more detailed explanation of this mechanism).
Already the [^>]+
is one or more characters that are not >
, which ensures that it will stop before the >
instead of going to the end of the string. This gives a gain that may or may not make a difference (in your case, with a few small strings, it’s not so much, compare here and here).
Comparison
I took a test comparing the above regex, its, and the solutions with indexOf
and the loop manual by string (is not something 100% accurate, but you can get an idea).
Tests vary from one run to another, but generally speaking, solutions with regex are slower (using .*
most of the time shows itself slower).
Anyway, I also did the test on Node (with the same code from link already quoted), using the Benchmark.js, and the results were similar:
regex1 x 2,676,016 ops/sec ±1.09% (89 runs sampled)
regex2 x 2,893,208 ops/sec ±1.67% (84 runs sampled)
for x 3,191,144 ops/sec ±1.52% (83 runs sampled)
indexOf x 5,174,035 ops/sec ±1.17% (90 runs sampled)
Fastest is indexOf
The value to be considered is "ops/sec" (transactions per second, that is, the larger, the better - of course you also have to consider the margin of error). Finally, indexOf
is the fastest, and the two regex were slowest (with a light advantage of the one you use [^>]+
).
And again: for a few small strings, the difference will be irrelevant. But since the performance was something mentioned in the question, I thought it was worth showing alternatives and pointing out the differences in performance between them.
Another option is to switch the regex to
/<([^>]+)>/
- the[^>]
(any character other than>
) ensures that she will stop when she finds the>
, which does not happen with the point, because it corresponds to any character, including the>
, then it first goes to the end of the string and then goes back (backtracking) until I find the>
(all right that in this case it is at the end and it will not be so much worse, but still it gets a "little" faster: https://jsbench.me/xeksi6olml/1)– hkotsubo