Regex to identify all occurrences of years

Question

Regex to identify all occurrences of years

Asked 6 years, 3 months ago

Viewed 317 times

5

I made this Regex to capture the occurrences of years in a sequence of numbers (between 2010-2029).

text = '0412020982012'
rg = r'20[1-2][0-9]'
years = re.compile("(%s)" % (rg)).findall(text)

Works perfectly.

But if the string is 0420202198, it doesn’t work because it only picks up the occurrence of 2020, but not of 2021, because it’s sharing the digits.

In my case, I always want to take the years identified more to the right. How can I solve keeping the use of Regex?

4 answers

6

Use:

.*(20[1-2][0-9])

Group 1 will always be the last sequence 20XX (where XX is in accordance with the regex criteria):

3420201298      -> grupo 1: 2012
    ¯¯¯¯
34352020982012  -> grupo 1: 2012
          ¯¯¯¯

The .* will fetch 0 or more characters to the left of the group, i.e., will consider the last sequence (2012) and not the first (2020).

REGEXR.COM

Okay, @Sam. I tested it here and it worked. Thank you very much.

– Antonio Braz Finizola

2019/05/02 at 00:40
@Antoniobrazfinizola findall returns a list of all pouch, so I understood that you wanted them all, not just the last. And the above regex only gets the last occurrence (if you are three years old, for example text = '041202021982012' - has 2020, 2021 and 2012, you just want the last or a list with the 3? Using .* you only get the latter by using a for as I suggested, you have a list with everyone). I’m sorry if I misunderstood what you needed, I just wanted to get the doubt even...

– hkotsubo

2019/05/02 at 00:45
Hi, @hkotsubo. Really my intention was to take the years further to the right, then take the last one from this list. I didn’t know that an algorithm could be made to get the last guy, like the one above.

– Antonio Braz Finizola

2019/05/02 at 00:52
@Antoniobrazfinizola It is that the way it was asked, I understood that you needed to catch all. But all right, good that solved :-)

– hkotsubo

2019/05/02 at 00:57

Browser other questions tagged python regex

You are not signed in. Login or sign up in order to post.

by hkotsubo • **55,826** points · Answer 1 · 2019-05-02T00:18:39+00:00

The problem is that findall traverse the string from left to right, and each time you find a match, the next search starts after the last found stretch.

In this case, in the string 0420202198, after finding the 2020, the search starts from the third 2 (what begins the stretch 2198), and therefore the 2021 is not found.

To do what you need, an option is to use the method match, passing the position where the search will begin:

import re

text = '0420202198'
r = re.compile('20[1-2][0-9]')
years = []
# fazer buscas começando em todas as posições da string
for pos in range(0, len(text)):
    m = r.match(text, pos)
    if m:
        years.append(m.group())

print(years)

In this case you do not need to include the parentheses in regex (they serve to create capture groups, who are returned by findall), for the method group, when called without parameters, already returns all the chunk that was found.

Besides, I didn’t have to raw string (the r before quotation marks). This syntax is useful when regex has characters such as \ (So I don’t have to write it as \\), but in this case it is not necessary.

And I also put the results on a list, so that the return is the same as findall (that returns a list of the results).

The exit is:

['2020', 'in 2021']

How your regex necessarily needs 4 digits to find a match, I could optimize a little bit the loop and use range(0, len(text) - 3) in the for (so I avoid iterating in the last 3 positions, because I know that from then on there are not enough characters to satisfy the regex).

You can also use the syntax of comprehensilist on, well over pythonic (and usually more succinct, but in this particular case, I do not know if it is so much):

import re

text = '0420202198'
r = re.compile('20[1-2][0-9]')    
years = [m.group() for pos in range(0, len(text) - 3) for m in [r.match(text, pos)] if m]
print(years)

Or else:

years = [m.group() for m in (r.match(text, pos) for pos in range(0, len(text) - 3)) if m]

Both options above removed from here.

by Robson Silva • **897** points · Answer 2 · 2019-05-01T20:21:47+00:00

1

Try changing the variable rg to obtain more than one occurrence of a group of numbers

rg = r'20[1-2]\d+'

For in this way its regex searches for one or more occurrence of digits. The part \d is equivalent to "[0-9]"

That doesn’t work because it’ll catch all digits that exist after the first 20 in a single match, see: https://ideone.com/QGS8ZW

– hkotsubo

2019/05/02 at 00:39
It’s true, it really won’t help. Unfortunately I don’t know how to help.

– Robson Silva

2019/05/02 at 00:52

by Rodrigo Zem • **895** points · Answer 3 · 2019-05-01T20:25:03+00:00

1

Follow Pattern to catch only the year 2012 which is the rightmost year as you said

(?:.*?\K20[1-2]{1}[2-9]{1}){2}

Click here to see the Example

1

Thanks for the answer, @Rodrigo. But I would like an approach that addresses the problem I indicated, which is to treat the cases of years that come like this: 0420202198. Like, there’s two years there, but the way I did I can only get 2020, not 2021.

– Antonio Braz Finizola

2019/05/01 at 22:14
Novo Pattern: (?:(? <=20)20[1-2]{1}[0-9]{1})|(?: (20[1-2]{1}[2-9]{1}))

– Rodrigo Zem

2019/05/02 at 03:04
I believe that now is what you want: https://regex101.com/r/1metXT/1

– Rodrigo Zem

2019/05/02 at 03:05