How does regular expression with double range work?

Question

How does regular expression with double range work?

Asked 4 years, 9 months ago

Viewed 187 times

6

I am studying regex and had a doubt when dealing with the range operator (-).

I understand a creak like [a-z] means "the whole alphabet in lowercase letters" and that a range like [^A-R]it would be something like "all capital letters that are not between A and R" (I’m not quite sure about this interpretation).

But what does a double range expression mean [^A-RU-Z]?

When the list is introduced by ^, she becomes a denied list. For your example, you are simply modifying the scope of a list A-Z filtering the characters that are included in crease A-R and U-Z - thus preserving the uppercase characters S and T. Rudimentary example in regex101.com 1 and 2.

– Rfroes87

2020/11/01 at 14:29
hum ... then it is equivalent to [ST]?

– Lucas

2020/11/01 at 14:37
No, in that case it would be covering all characters (of any type) that do not match the pattern you specified in the denied list - including the S and T.

– Rfroes87

2020/11/01 at 14:40
Got it. Thank you. If you want, use your comment as an answer, I give it

– Lucas

2020/11/01 at 14:41
1

Remember that a range is not restricted to letters, you can use any character, then [!-;] and [最-] are valid ranges (the latter works if the engine has the proper Unicode support, which varies according to the language/tool). And in the end - the answers have already explained in detail, but in summary - a range is just a way to "abbreviate" a set of characters, so you don’t have to type one by one, so the fact that there are 2 or more ranges in the same class is nothing special :-)

– hkotsubo

2020/11/01 at 17:45

2 answers

7

Character class

When you use character class notations - the notations between brackets ([]) - Are you saying that: capture a character that is within that character set.

When you use the -, you are setting a sash of characters, i.e., a crease. It ends up serving as an "abbreviation" for a set of characters that are in sequence. For example:

let regex1 = /[A-F]/;
// se torna o mesmo que
let regex2 = /[ABCDEF]/;

// Aqui é testado se todos os caracteres
// da lista atendem ao regex

// true
console.log(['B', 'E', 'D'].every((char) => regex1.exec(char)));
// true
console.log(['B', 'E', 'D'].every((char) => regex2.exec(char)));

That is, in both regexes, you ask that: capture a character that is in the track/character set of A to F (capital letters).

However, note that "character" is in the singular, in the sentence.

Why? Because this character class notation alone will only capture 1 character.

To capture more than 1 character, you now have to use one of the "multiple capture" notations, which incidentally are several. But, an example would be:

// irá capturar toda a string
console.log(/[A-F]+/.exec('BED'));

That is to say: capture 1, or more, character(s) in the range of A to F.

Another example:

let regex3 = /[A-F]{2,3}/

// captura 'AB'
console.log(regex3.exec('GHIABJ'));
// captura 'DEF'
console.log(regex3.exec('KLDEFMN'));

That is to say: capture of (at the very least) 2 to (at most) 3 characters in the range A to F.

Character class with negation

From there, we can also use the negation notation. That is, instead of "capture a character that is as a whole [...]", we can also do "capture a character that are not as a whole [...]".

To establish this negation, we put the symbol ^ after the opening bracket ([) of notation.

So using the example above, we can see what happens now:

let regex4 = /[^A-F]{2,3}/

// captura 'GHI'
console.log(regex4.exec('GHIABJ'));
// captura 'KL'
console.log(regex4.exec('KLDEFMN'));

That is, now the regex says that: capture 2 to 3 characters that are not in the range of A to F.

Therefore, we can now reverse the capture of characters so that are not in a certain character range.

Band junction (ranges)

And finally, we come to the point of which the question!

As we saw earlier, the hyphen (-) serves, in a way, to "abbreviate" a certain range/string. Therefore, if we want to join a crease with another, just add the new crease set - that is, within the brackets ([]) - thus:

[A-DW-Z]

For that would be the same as:

[ABCDWXYZ]

Which is also valid! Let’s see:

let regex6 = /[ABCDWXYZ]+/;
let regex7 = /[A-DW-Z]+/;

// captura 'ABCD'
console.log(regex6.exec('EFABCDGH'));
console.log(regex7.exec('EFABCDGH'));

// captura 'WXYZ'
console.log(regex6.exec('STWXYZUV'));
console.log(regex7.exec('STWXYZUV'));

Therefore, when you use the regex [^A-RU-Z], you’re saying: capture a character that are not in the range of A to R and of U to Z.

That is, as I mentioned above, it would be the same as:

[^ABCDEFGHIJKLMNOPQRUVWXYZ]

Care!

One of the first things that should be noted, however, is that the "engines of regex" can change behavior, from language to language. In this case, I used Javascript. But in other languages, some of these things may act differently (from what is found in Javascript).

A second thing to note is that the regexes, normally, are case sensitive (sensitive to the state of the letter, uppercase or lowercase). Therefore, as I put in the first example, all of these regexes will only apply to characters that are the same as you defined them. So, as stated in the other reply, strings as:

"eSTe TexTo Será capTurado"

Would pass by regex, if she were [^A-RU-Z]+, for example.

Therefore, some languages offer the means to make the regex "insensitive to marry of the letter". In Javascript, for example, it would add the flag i at the end of regex:

// captura somente 'ST'
console.log(/[^A-RU-Z]+/i.exec("eSTe TexTo não Será capTurado"));

In the same way, another thing that can be noticed is that, by the way I did, will only be captured the first sequence of characters that are within the range! That is to say:

let matches1 = Array.from('ACMN STWZ'.match(/[ABCDWXYZ]+/));
let matches2 = Array.from('ACMN STWZ'.match(/[A-DW-Z]+/));

// em ambos os casos, só irá capturar o `AC`

console.log(matches1);
console.log(matches2);

So that it captures all occurrences that may exist, in Javascript, I put the flag g at the end of regex:

matches1 = Array.from('ACMN STWZ'.match(/[ABCDWXYZ]+/g));
matches2 = Array.from('ACMN STWZ'.match(/[A-DW-Z]+/g));

// captura tanto o `AC` como o `WZ`
console.log(matches1);
console.log(matches2);

Even, you can see in that question, the difference between using a regex overall with the method String.prototype.match (as I put above) and with the method String.prototype.matchAll.

Completion

So we see that there are so many things that must be taken into account when assembling a regex - among them, the very language where you will use it - that if I were to try to cover here, would leave the answer well [more] extensive!

Some content on regex:

1

Great answer, I believe it is the most indicated answer selected for this question.

– Rfroes87

2020/11/01 at 16:29

Browser other questions tagged regex

You are not signed in. Login or sign up in order to post.

by Rfroes87 • **465** points · Answer 1 · 2020-11-01T14:55:32+00:00

When the list is introduced by ^, she becomes a denied list. For your example, you are simply modifying the scope of a list A-Z filtering the characters that are included in crease A-R and U-Z.

Note in the following examples that the regular expression will be resulting in uppercase characters S and T (which do not fall within the scope of crease set) as well as all other types of characters (lower case, numbers, accented characters, punctuation, ...):

Absuc/1
The resulting characters would be b, S, c, / and 1
Example 1
Úvaxt.2
The resulting characters would be all characters, including the Ú (U accented is considered a distinct character)
Example 2