How to extract substring in string array with Javascript using match?

Question

How to extract substring in string array with Javascript using match?

Asked 4 years, 7 months ago

Viewed 71 times

2

I have to manipulate a data in the following format: ["nome <email desse nome>"], an array with several strings in the same format. I want to extract in another array only the emails, without the <>

I have tried to convert to simple array and use method String.match() JS with the regexp \<(.*)\>, but the return is another array and with other data that are useless to me.

Any tips on how to do it in a minimally performatic way?

2 answers

2

The result you reported is expected, see:

I have tried to convert to simple array and use the String.match() JS with the regexp <(.*)>, but the return is another array and with other data that are useless to me.

See the documentation of match. In fact, it returns an array with several other information. One of this information is exactly what you need - the value captured by the group that is in regex.

Therefore, you need to extract the capture group value from the array returned by match. Since you only have one capture group, it corresponds to the second element of the returned list (since the first matches the tested string).

A simple example:

const str = 'John Doe <[email protected]>';
const match = str.match(/<(.*)>/);
console.log(match[1]); // somente o e-mail (capturado pelo 1º e único grupo)

In the case of an array of strings, each of which will be invoked match, you can use a map. Thus:

const strs = [
  'foo <[email protected]>',
  'bar <[email protected]>',
  'baz <[email protected]>',
  'invalid'
];

const emails = strs.map((str) => {
  return str.match(/<(.*)>/)?.[1];
});
console.log(emails);

What happens above is simple: for each of the strings in the list, we map to the second element of the array returned by match. See more about the map here.

Note that in case the string does not match the regular expression, match returns null (as occurred in the last string of the array above). In this type of situation, Javascript would throw an error once we tried to index 1 in null. To solve this, I used the optional chaining (returning undefined) in cases like this. Of course, you can treat these situations differently.

Another option, without using regular expression, is to get the index of the first < and return to substring which must be followed by >. Thus:

function getEmail(str) {
  const start = str.indexOf('<');
  if (start === -1) return undefined;

  const end = str.length - 1;
  if (str[end] !== '>') return undefined;

  return str.substring(start + 1, end);
}

const strs = [
  'foo <[email protected]>',
  'bar <[email protected]>',
  'baz <[email protected]>',
  'invalid'
];

const emails = strs.map(getEmail);
console.log(emails);

Of course, by not using regular expression, this last code will be well faster. However, it is lost in expressiveness and size of implementation. Regular expression, in this case, seems to me to be simpler.

And there’s still the point about the validity of the e-mail, but then it’s gone another matter - and spoiler: far more complex.

1

Another option is to switch the regex to /<([^>]+)>/ - the [^>] (any character other than >) ensures that she will stop when she finds the >, which does not happen with the point, because it corresponds to any character, including the >, then it first goes to the end of the string and then goes back (backtracking) until I find the > (all right that in this case it is at the end and it will not be so much worse, but still it gets a "little" faster: https://jsbench.me/xeksi6olml/1)

– hkotsubo

2021/08/19 at 00:34

Browser other questions tagged javascript array string regex substring

You are not signed in. Login or sign up in order to post.

by hkotsubo • **55,826** points · Answer 1 · 2021-08-19T14:06:23+00:00

Any tips on how to do it in a minimally performatic way

Whenever you want something "performatic", regex usually is not among the first options (for a number of reasons, such as the fact of engine need to compile the expression and various other internal operating details that generate a overhead huge).

Of course, for small strings being checked a few times, the difference will be imperceptible (after all, for few data, everything is fast), but since the "need" for performance has been mentioned, I think it is worth giving alternatives (with and without regex).

The alternatives below assume that the format is always as described: a set of characters always between < >, may have some others before (and perhaps after?), and at most only an occurrence of only one occurrence of < and > (because I also deal with cases where you don’t have).

Regex-free

A simple way is to use indexOf to obtain the indexes where the < and >, and then get the stretch between them with substring:

function getEmail(s) {
    // busca a posição do '<'
    var inicio = s.indexOf('<');
    if (inicio === -1) // se não tem, já retorna
        return undefined;
    // busca a posição do '>', mas começando a busca a partir da posição do '<'
    var fim = s.indexOf('>', inicio);
    if (fim === -1) // se não tem, já retorna
        return undefined;
    // retorna o trecho entre '<' e '>'
    return s.substring(inicio + 1, fim);
}

const strs = [ 'foo <[email protected]>', 'bar <[email protected]>', 'baz <[email protected]> fdafad fad', 'invalid' ];
const emails = strs.map(getEmail);
console.log(emails); // [ '[email protected]', '[email protected]', '[email protected]', undefined ]

Or, if you want to do everything "by hand", just one loop simple:

function getEmail(s) {
    var inicio = null, fim = null;
    for (var i = 0; i < s.length; i++) {
        var c = s[i];
        if (c === '<') {
            inicio = i;
        } else if (c === '>' && inicio !== null) {
            fim = i;
            break;
        }
    }
    if (inicio !== null && fim !== null)
        return s.substring(inicio + 1, fim);
    return undefined;
}

But I think the version with indexOf is simpler (and also faster, as we will see at the end).

Regex

If you really want to continue with regex, you can improve a little (but not too much):

const strs = [ 'foo <[email protected]>', 'bar <[email protected]>', 'baz <[email protected]> fdafad fad', 'invalid' ];
const emails = strs.map(str => str.match(/<([^>]+)>/)?.[1]);
console.log(emails); // [ '[email protected]', '[email protected]', '[email protected]', undefined ]

The change is in the "brain": instead of .* (meaning "zero or more characters"), I used [^>]+:

the [^>] is a character class denied, meaning "any character that nay be it >"
the quantifier + means "one or more" (in this context I think it best that *, which means "zero or more", meaning it also takes cases like <> - already the + requires to have at least one character between < and >)

The difference occurs because the quantifiers * and + are "greedy" (or "greedy") and try to pick up as many characters as possible. In the case, as the point corresponds to any character, then .* goes to the end of the string (i.e., it also consumes the >), and if you don’t find one match, he does the backtracking: starts returning characters until you find one match (here has a more detailed explanation of this mechanism).

Already the [^>]+ is one or more characters that are not >, which ensures that it will stop before the > instead of going to the end of the string. This gives a gain that may or may not make a difference (in your case, with a few small strings, it’s not so much, compare here and here).

Comparison

I took a test comparing the above regex, its, and the solutions with indexOf and the loop manual by string (is not something 100% accurate, but you can get an idea).

Tests vary from one run to another, but generally speaking, solutions with regex are slower (using .* most of the time shows itself slower).

Anyway, I also did the test on Node (with the same code from link already quoted), using the Benchmark.js, and the results were similar:

regex1 x 2,676,016 ops/sec ±1.09% (89 runs sampled)
regex2 x 2,893,208 ops/sec ±1.67% (84 runs sampled)
for x 3,191,144 ops/sec ±1.52% (83 runs sampled)
indexOf x 5,174,035 ops/sec ±1.17% (90 runs sampled)
Fastest is indexOf

The value to be considered is "ops/sec" (transactions per second, that is, the larger, the better - of course you also have to consider the margin of error). Finally, indexOf is the fastest, and the two regex were slowest (with a light advantage of the one you use [^>]+).

And again: for a few small strings, the difference will be irrelevant. But since the performance was something mentioned in the question, I thought it was worth showing alternatives and pointing out the differences in performance between them.