Any tips on how to do it in a minimally performatic way
Whenever you want something "performatic", regex usually is not among the first options (for a number of reasons, such as the fact of engine need to compile the expression and various other internal operating details that generate a overhead huge).
Of course, for small strings being checked a few times, the difference will be imperceptible (after all, for few data, everything is fast), but since the "need" for performance has been mentioned, I think it is worth giving alternatives (with and without regex).
The alternatives below assume that the format is always as described: a set of characters always between < >, may have some others before (and perhaps after?), and at most only an occurrence of only one occurrence of < and > (because I also deal with cases where you don’t have).
Regex-free
A simple way is to use indexOf to obtain the indexes where the < and >, and then get the stretch between them with substring:
function getEmail(s) {
// busca a posição do '<'
var inicio = s.indexOf('<');
if (inicio === -1) // se não tem, já retorna
return undefined;
// busca a posição do '>', mas começando a busca a partir da posição do '<'
var fim = s.indexOf('>', inicio);
if (fim === -1) // se não tem, já retorna
return undefined;
// retorna o trecho entre '<' e '>'
return s.substring(inicio + 1, fim);
}
const strs = [ 'foo <[email protected]>', 'bar <[email protected]>', 'baz <[email protected]> fdafad fad', 'invalid' ];
const emails = strs.map(getEmail);
console.log(emails); // [ '[email protected]', '[email protected]', '[email protected]', undefined ]
Or, if you want to do everything "by hand", just one loop simple:
function getEmail(s) {
var inicio = null, fim = null;
for (var i = 0; i < s.length; i++) {
var c = s[i];
if (c === '<') {
inicio = i;
} else if (c === '>' && inicio !== null) {
fim = i;
break;
}
}
if (inicio !== null && fim !== null)
return s.substring(inicio + 1, fim);
return undefined;
}
But I think the version with indexOf is simpler (and also faster, as we will see at the end).
Regex
If you really want to continue with regex, you can improve a little (but not too much):
The change is in the "brain": instead of .* (meaning "zero or more characters"), I used [^>]+:
- the
[^>] is a character class denied, meaning "any character that nay be it >"
- the quantifier
+ means "one or more" (in this context I think it best that *, which means "zero or more", meaning it also takes cases like <> - already the + requires to have at least one character between < and >)
The difference occurs because the quantifiers * and + are "greedy" (or "greedy") and try to pick up as many characters as possible. In the case, as the point corresponds to any character, then .* goes to the end of the string (i.e., it also consumes the >), and if you don’t find one match, he does the backtracking: starts returning characters until you find one match (here has a more detailed explanation of this mechanism).
Already the [^>]+ is one or more characters that are not >, which ensures that it will stop before the > instead of going to the end of the string. This gives a gain that may or may not make a difference (in your case, with a few small strings, it’s not so much, compare here and here).
Comparison
I took a test comparing the above regex, its, and the solutions with indexOf and the loop manual by string (is not something 100% accurate, but you can get an idea).
Tests vary from one run to another, but generally speaking, solutions with regex are slower (using .* most of the time shows itself slower).
Anyway, I also did the test on Node (with the same code from link already quoted), using the Benchmark.js, and the results were similar:
regex1 x 2,676,016 ops/sec ±1.09% (89 runs sampled)
regex2 x 2,893,208 ops/sec ±1.67% (84 runs sampled)
for x 3,191,144 ops/sec ±1.52% (83 runs sampled)
indexOf x 5,174,035 ops/sec ±1.17% (90 runs sampled)
Fastest is indexOf
The value to be considered is "ops/sec" (transactions per second, that is, the larger, the better - of course you also have to consider the margin of error). Finally, indexOf is the fastest, and the two regex were slowest (with a light advantage of the one you use [^>]+).
And again: for a few small strings, the difference will be irrelevant. But since the performance was something mentioned in the question, I thought it was worth showing alternatives and pointing out the differences in performance between them.
Another option is to switch the regex to
/<([^>]+)>/- the[^>](any character other than>) ensures that she will stop when she finds the>, which does not happen with the point, because it corresponds to any character, including the>, then it first goes to the end of the string and then goes back (backtracking) until I find the>(all right that in this case it is at the end and it will not be so much worse, but still it gets a "little" faster: https://jsbench.me/xeksi6olml/1)– hkotsubo