What is Regexp Match Indices?

Asked

Viewed 91 times

5

I recently found this package, which serves as a polyfill for a new Javascript feature. The proposal Regexp Match Indices was finalized recently, which means it will soon be part of the language.

On the page of polyfill no npm, the following:

The implementation is a replacement for RegExp.prototype.exec that approaches the behavior of the proposal. How RegExp.prototype.exec depends on a receiver (the value of this), the main export accepts Regexp to operate as the first argument.

It even demonstrates an example code (Polyfill):

const execWithIndices = require('regexp-match-indices');

const text = 'zabbcdef';
const re = new RegExp('ab*(cd(?<Z>ef)?)');
const result = execWithIndices(re, text);
console.log(result.indices); // [[1, 8], [4, 8], [6, 8]]

The result above (console.log), is very confusing to my person as I do not have a decent knowledge about regular expressions.

Things got more confusing when I read the proposal document on Github and analyzed the code that simulates the real implementation:

const re1 = /a+(?<Z>z)?/d;

// indices are relative to start of the input string:
const s1 = 'xaaaz';
const m1 = re1.exec(s1);
m1.indices[0][0] === 1;
m1.indices[0][1] === 5;
s1.slice(...m1.indices[0]) === 'aaaz';

m1.indices[1][0] === 4;
m1.indices[1][1] === 5;
s1.slice(...m1.indices[1]) === 'z';

m1.indices.groups['Z'][0] === 4;
m1.indices.groups['Z'][1] === 5;
s1.slice(...m1.indices.groups['Z']) === 'z';

// capture groups that are not matched return `undefined`:
const m2 = re1.exec('xaaay');
m2.indices[1] === undefined;
m2.indices.groups['Z'] === undefined;

In the official implementation (code above), it makes use of a flag d, but this is not yet included in the documentation of the Mozilla.

1 From what I understand, this flag d creates a property indices and returns the catch indices, but not sure. Exemplifying, using without the flag d, we have the object:

const re1 = /a+(?<Z>z)?/;

const s1 = "xaaaz";
const m1 = re1.exec(s1)

console.log(m1)

// saida
// (2) ["aaaz", "z", index: 1, input: "xaaaz", groups: {…}]
//  0: "aaaz"
//  1: "z"
//  groups: {Z: "z"}
//  index: 1
//  input: "xaaaz"
//  length: 2
//  __proto__: Array(0)

The above output does not appear in the OS snippet.

In the NPM code, I ran the script and similar to the documentation (with the Polyfill of course) to analyze and compare the behavior of using the flag d. This returns "basically" the same thing:

const execWithIndices = require("regexp-match-indices");

const text = "xaaaz";
const re = new RegExp("a+(?<Z>z)?");
const result = execWithIndices(re, text);

console.log(result)

// saida
// [
//   'aaaz',
//   'z',
//   index: 1,
//   input: 'xaaaz',
//   groups: [Object: null prototype] { Z: 'z' },
//   indices: [Getter/Setter]
// ]

Only now, he comes with this property indices which represent where the match of the capture group (I think), so my statement (1).

I would like to know more about the purpose of this new flag d, of this functionality, an explanation of use case and, preferably, an explanation of use in the above code.

(?<Z>z) is a capture group, right?

  • 3

    Yes, (?<Z>z) is a named catch group. Although the syntax may seem "difficult", the <Z> only indicates the name of the group, which in this case is Z. :)

1 answer

6


The idea of proposal is to return the initial and final index of the match found, and also the capture groups, when present.

Before this proposal, by using RegExp.prototype.exec, String.prototype.match or String.prototype.matchAll, the most we had was the initial index in which the match is found. That is, in this code:

const s1 = 'zabbcdef';
const m1 = s1.match(/ab*(cd(?<Z>ef)?)/);
for (const e in m1) {
    console.log(e, m1[e]);
}

The result (the array m1) owns a property index, indicating the index on which the match starts (in this case, it is 1, the position of the string where the a). He also has the property groups, which contains the named groups (one appointed group - in this case, is the (?<Z>ef), indicating that the content ef is part of the group whose name is "Z").

In the array m1 the capture groups themselves are also returned (cdef and ef), but there is no information about its contents.


The idea of the proposal is to have the initial and final indexes of the match whole and also of each capture group. In the case of the above regex, we have 2 groups:

  • (?<Z>ef) is a named group (its name is "Z", the content is ef)
  • (cd(?<Z>ef)?) is a group without name, its content is cd followed by the contents of the "Z" group (and the entire "Z" group is optional as it has the ? soon after)

In this case, the groups are "numbered" in the order they appear: the group they have cd etc... is the first, and the appointed group is the second.

Finally, the indexes returned in the example of polyfill sane:

  • [ 1, 8 ]: where the match whole, because 1 is the position where the a initiating the regex, and 8 is a position after where the match ends - in this case, that’s where the f
  • [ 4, 8 ]: the unnamed group starts at index 4 - that’s where the c in the string, that’s where the sub-match concerning this group
  • [ 6, 8 ]: 6 is the index on which the e, that’s where the sub-match concerning the group.

And when there is a named group, its indices are also placed on indices.groups, in the form of an object, in which the keys are the names of the groups and the values are the respective indexes.


How the nominated group is optional (indicated by ?), if the string were zabbcd123, the last index group ([6, 8]) would not be returned (in its place, is placed undefined).

According to the proposal, the property indices would only be returned if the regex has the flag d. That is, the regex would be created as /ab*(cd(?<Z>ef)?)/d or new RegExp('ab*(cd(?<Z>ef)?)', 'd').

Recently (May/2021) MDN updated the documentation, and there is already flag d: see here and here. It is also interesting to note that every instance of RegExp will get the property hasIndices, indicating whether the flag d was used (true or false). But be sure to consult the compatibility table before leaving using, because it is not yet all browsers that support.

Therefore, the code below may or may not work in your browser (I tested in Chrome 90 and was):

// maio/2021 - só funciona em alguns browsers (testado no Chrome 90)
var r = /ab*(cd(?<Z>ef)?)/d; // regex com a flag d
console.log('tem a flag:', r.hasIndices); // true

var result = 'zabbcdef'.match(r);
console.log('índices:', result.indices);
console.log('índices dos grupos:', result.indices.groups);


Just to try to clarify a little more, follow another example:

const execWithIndices = require("regexp-match-indices");
const text = "- abc 123 xy 4567 .";
const result = execWithIndices(/([a-z]+) (?<nums>\d+) ([a-z]+) (?<othernums>\d+)/, text);
console.log(result.indices);

regex searches for strings of letters ([a-z]+) and numbers (\d+), the numbers are in named groups, and the letters are in "normal" groups (no name).

To be more precise, regex searches for letters, space, numbers, space, letters, space and numbers. There are four capture groups: the first and third search for the letters, and the second and fourth search for the numbers (and these have the names "nums" and "othernums").

In this case, the value of the property indices is the array:

[
  [ 2, 17 ],
  [ 2, 5 ],
  [ 6, 9 ],
  [ 10, 12 ],
  [ 13, 17 ],
  groups: { nums: [ 6, 9 ], othernums: [ 13, 17 ] }
]

In this case, the array elements are:

  • [2, 17]: the indices corresponding to the whole match found (i.e., corresponds to the entire excerpt "abc 123 xy 4567")
  • [2, 5]: the indexes that correspond to the first capture group (the first occurrence of "one or more letters" - the excerpt "abc")
  • [6, 9]: the indices corresponding to the second capture group (the first occurrence of "one or more digits" - the "123")
  • [10, 12]: the indices corresponding to the third capture group (the second occurrence of "one or more letters" - the "xy")
  • [13, 17]: the indices corresponding to the fourth capture group (the second occurrence of "one or more digits" - the entry "4567")
  • the property groups, which is an object containing the indices corresponding to the named groups (the name of each group being a key, and the value is the respective array containing the indices)

Now, if the second occurrence of letters and numbers is optional:

const execWithIndices = require("regexp-match-indices");
const text = "- abc 123.";
const result = execWithIndices(/([a-z]+) (?<nums>\d+)(?: ([a-z]+) (?<othernums>\d+))?/, text);
console.log(result.indices);

The result will be:

[
  [ 2, 9 ],
  [ 2, 5 ],
  [ 6, 9 ],
  undefined,
  undefined,
  groups: { nums: [ 6, 9 ], othernums: undefined }
]

That is, it was returned undefined in the positions corresponding to the groups that are present in the regex, but because they are in a part that is optional, they were not eventually filled.

And of course, if the regex has no capture group, only the indexes referring to the match found. That is, in the case below:

const execWithIndices = require("regexp-match-indices");
const text = "- abc 123 xy 4567 .";
const result = execWithIndices(/[a-z]+ \d+/, text);
console.log(result.indices);

The result will be:

[ [ 2, 9 ], groups: undefined ]

That is, the indexes [2, 9] indicate where the match (that in the case are letters, space and numbers), and as there are no groups, there are no more elements (and as there are no named groups, the property groups is undefined).


Remember that in the npm package the property indices is by default Lazy, and is populated only if requested (ie if you only use result in the above examples, the result.indices will not be populated, only when you access directly result.indices is that it has the array with the indexes). This behavior can be changed to be equal to the specification, in which there is no behavior Lazy, See the difference:

const execWithIndices = require("regexp-match-indices");
const text = "- abc 123 xy 4567 .";
let result = execWithIndices(/[a-z]+ \d+/, text);
console.log(result); // mostra "indices: [Getter/Setter]"

// desativar o modo "lazy", deixar igual ao do especificação
require("regexp-match-indices/config").mode = "spec-compliant";
result = execWithIndices(/[a-z]+ \d+/, text);
console.log(result); // mostra "indices: [ [2, 9], groups: undefined ]"

Browser other questions tagged

You are not signed in. Login or sign up in order to post.