Regexp extract value

Asked

Viewed 225 times

3

I have the following strings:

"The.Office.US. S{SE}E{EP}.the.Dundies.720p.srt"

"The Office [{SE}. {EP}] The Fight.srt"

This string is a "template" of a file name, the files will be in the following form

"The.Office.US.S01E06.the.Dundies.720p.srt"

"The Office [01.06] The Fight.srt"

I need to extract the 01 and 06 values of these strings using python. But I’m not able to mount a regexp that works for my case

#encoding: utf-8
import re
template = "The.Office.US.S{SE}E{EP}.The.Dundies.720p.srt"
arq = "The.Office.US.S01E06.The.Dundies.720p.srt"

#Nesta linha que está minha dificuldade
pat = re.compile('\{.*?\}')

season, episode = re.findall(pat, text)
print("Temporada: ", season)
print("Episódio: ", episode)

2 answers

4

Edited, final version

Meets 2 templates with one regex. This version results in a tuple with numerical values. As the regex meets the 2 templates, the tuple will always return 4 items, 2 of them None, unless the string is the sum of the two templates. You can also take the values of the groups separately, that is, to know what was between the brackets (template 1), consider the groups: 'Bracket' and 'Point'. For the second template take the groups’S' and 'E'. Observing The code below works with a single regular expression that is actually composed of two [separated by pipe (|)], so it is possible to build a more granular version, with a regex for each template, as exposed shortly afterwards.
import re

s1 = "The Office [01.06] The Fight.srt"
s2 = 'The.Office.US.S01E06.The.Dundies.720p.srt'
padrao = '(?P<Colchete>\\d{2})\\.(?P<Ponto>\\d{2})|(?P<S>\\d{2})E(?P<E>\\d{2})'
re1 = re.compile(padrao)

print ('## Resultado para s1 ##')
print ('Groups: ',re1.search(s1).groups())
print ('Colchete: ', re1.search(s1).group('Colchete'))
print ('Ponto: ', re1.search(s1).group('Ponto'),'\n')

## Resultado para s1 ##
Groups:  ('01', '06', None, None)
Colchete:  01
Ponto:  06

print ('## Resultado para s2 ##')
print ('Groups: ',re1.search(s2).groups())
print ('S: ', re1.search(s2).group('S'))
print ('E: ', re1.search(s2).group('E'))

## Resultado para s2 ##
Groups:  (None, None, '01', '06')
S:  01
E:  06

You can even make a more granular version by breaking the regex into 2 and working separately with the templates, something like this:

padrao1 = '(?P<Colchete>\\d{2})\\.(?P<Ponto>\\d{2})'
padrao2 = '(?P<S>\\d{2})E(?P<E>\\d{2})'

re_p1 = re.compile(padrao1)
re_p2 = re.compile(padrao2)

print ('## Resultados para a versão Granular ##')

print ('## Para s1 ##')
print ('Groups: ',re_p1.search(s1).groups())
print ('Colchete: ', re_p1.search(s1).group('Colchete'))
print ('Ponto: ', re_p1.search(s1).group('Ponto'),'\n')

## Resultados para a versão Granular ##
## Para s1 ##
Groups:  ('01', '06')
Colchete:  01
Ponto:  06 

print ('## Para s2 ##')
print ('Groups: ',re_p2.search(s2).groups())
print ('S: ', re_p2.search(s2).group('S'))
print ('E: ', re_p2.search(s2).group('E'))

## Para s2 ##
Groups:  ('01', '06')
S:  01
E:  06

DEMO

  • This solution would only work for the second string, I need something that works for both. I know that with a little gambiarra you can use this way. But I need to avoid to the maximum

  • Yeah, after I saw the other template, I edited and I totally changed the version, see that now works for qq string, you will be able to remove all numbers, then just adapt.

  • \d+ can create some trouble if the series name has numbers. This solution would not meet The 4400

  • 1

    There you have to have a pattern, you can put the \d+ between two characters that the template informs, as I said just adapt. Now.... if it comes outside the template pattern, it is paratically impossible

  • @Sidon r'^.*( \[([0-9]+)\.([0-9]+)\] |\.S([0-9]+)E([0-9]+)\.).*\.srt$', taking for an answer SE=\2\4 a Season the third or fourth rear-view mirror and EP=\3\5 the episode the third or fifth rear-view mirror; as the rear-view mirrors 2,3 happen simultaneously and 4,5 also, and that 2,3 is excluding with 4,5 , we have that this expression in a way has more scope and would give the same positive results ;-) Ok, this is a solution sed-ica, not pythonica, but is not so absurdly far from being pythonizável

  • 1

    I found a very pythonic solution, "I stole" from Django, I will edit the answer.

  • @Sidon marks me when he edits?

  • The first solution adapts more to my problem, I will have to make some adjustments. But it will probably be something in this line.

  • I would just like to remind you that the templates are not fixed, the user who will inform, and what will define the position of the values are the tags {SE} and {EP}

  • 1

    Note that if you come in the pattern started with 2 digits, the sign E in the middle and then 2 more digits, the regex will always find, no matter what position you are in. The second solution is more sophisticated, if it comes in the default, it will probably never fail. @Jeffersonquesado.

  • 1

    In time: adapting to the first template using the second solution is quite simple.

  • 1

    @Jeffersonquesado, I edited the reply again, with a version that meets the two templates.

Show 7 more comments

2


You have to turn your template into a regex. As it has several characters that are special to a regular expression, you first need to escape them. Then just replace {SE} and {EP} by a group catching [0-9]+ and ready. The code below does this:

def template2regex(template):
    template = re.escape(template)

    regex = template.replace('\{SE\}', '(?P<season>[0-9]+)')
    regex = regex.replace('\{EP\}', '(?P<episode>[0-9]+)')

    return re.compile(regex)

    template = "The.Office.US.S{SE}E{EP}.The.Dundies.720p.srt"
    regex = template2regex(template)
    regex.search('The.Office.US.S01E06.The.Dundies.720p.srt').groups()
('01', '06')
    template = 'The Office [{SE}.{EP}] The Fight.srt'
    regex = template2regex(template)
    regex.search('The Office [01.06] The Fight.srt').groups()
('01', '06') 

Browser other questions tagged

You are not signed in. Login or sign up in order to post.