Regex to find text between square brackets

Asked

Viewed 437 times

1

I’m trying to create a regex to identify the following occurrence:

[Ticket: 20021501280806]

I need an expression that identifies the ticket number, but only within the string [Ticket: ].

Actually I did the following:

r'[Ticket: (\d+)]'

But it didn’t work.

2 answers

4


The brackets have special meaning in regex: they create a character class. For example, [abc] means "the letter a or the letter b or the letter c" (any of them).

So the expression [Ticket: (\d+)] is a character class meaning "the letter T or the letter i or the letter c, etc..." - the detail is that all this expression corresponds to only one character (anyone who is among the options within the brackets).

Moreover, many meta-characters (those who have some special meaning in regex) "lose their powers" when they are inside the brackets. So in this regex, the parentheses and the plus sign literally mean the characters (, ) and +, which means this regex will also find a match if the string is something like +(), for example - see here an example.

Anyway, for regex to take literally the characters [ and ], Just slip them away with \. Then the regex must be \[Ticket: (\d+)\].


Options to find the pouch

It was unclear whether the string has multiple occurrences of "[Ticket: (números)]" and you want to find all, or if this stretch only occurs once. Anyway, let’s see some options.

If there are several occurrences of this text and you want to capture them all, you can use findall, which returns a list of all occurrences:

import re

texto = 'lorem ipsum [Ticket: 20021501280806] blablabla [Ticket: 123456789] etc [Ticket: 987654] xyz.'
r = re.compile(r'\[Ticket: (\d+)\]')
matches = r.findall(texto)

print(matches) # ['20021501280806', '123456789', '987654']

See the code running on Ideone.com

In this case, the section containing the numbers (\d+) is in parentheses, which forms a catch group. And when regex has capture groups, findall returns only them. So the list already returns only the numbers.

If you want, you don’t have to compile and can use regex directly:

matches = re.findall(r'\[Ticket: (\d+)\]', texto)

According to the documentation, the use of compile is more efficient if the same regex is used several times in the same program. It is up to you to choose which one to use.


Another option is to use finditer, that returns a iterator containing the pouch:

import re

texto = 'lorem ipsum [Ticket: 20021501280806] blablabla [Ticket: 123456789] etc [Ticket: 987654] xyz.'
r = re.compile(r'\[Ticket: (\d+)\]')

for match in r.finditer(texto):
    print(match.group(1))

See the code running on Ideone.com

With each iteration of for, is returned a Match containing information about the section that was found. As the information that interests me is the one in the capture group, I use the method group to get it. And as the stretch (\d+) is the first pair of parentheses of regex, so it is the first capture group (group 1), so I do match.group(1) to get the snippet that was captured. With each iteration of for, the match contains one of the occurrences found. The output is:

20021501280806
123456789
987654

The difference between the two approaches above is that findall returns a list of all occurrences found, while finditer returns a iterator, who carries only one match at a time each iteration. In case there are many occurrences to be found, finditer will spend a lot less memory (by not loading all pouch at once), and does not search for all occurrences if the loop is interrupted, for example (already findall always need to load all occurrences to return the list).

And just like findall, you can also use finditer without having to call compile before:

for match in re.finditer(r'\[Ticket: (\d+)\]', texto):
    print(match.group(1))

If the text only occurs once - or if it occurs several times, but you only want the first occurrence - you can use search:

import re

texto = 'lorem ipsum [Ticket: 20021501280806] blablabla [Ticket: 123456789] etc [Ticket: 987654] xyz.'
r = re.compile(r'\[Ticket: (\d+)\]')

match = r.search(texto)
if match:
    print(match.group(1)) # 20021501280806

See the code running on Ideone.com

In this case it finds the first occurrence of regex in the text, ignoring the others. And as well as the above options, there is also the option to use search directly, without calling compile before:

match = re.search(r'\[Ticket: (\d+)\]', texto)
if match:
    print(match.group(1))

3

The problem is that in your string there are brackets, which are special characters in regex, as you can see here in documentation.

To solve this problem, you should remove these brackets from your string or use on Pattern the character \ before the brackets, to tell the regex that it should not be considered a special character. See the example below:

string = "[Ticket: 20021501280806]"
re.findall(r'\[Ticket: (\d+)\]', string)

Browser other questions tagged

You are not signed in. Login or sign up in order to post.