Regexp extract value

Question

Regexp extract value

Asked 8 years, 1 month ago

Viewed 225 times

3

I have the following strings:

"The.Office.US. S{SE}E{EP}.the.Dundies.720p.srt"

"The Office [{SE}. {EP}] The Fight.srt"

This string is a "template" of a file name, the files will be in the following form

"The.Office.US.S01E06.the.Dundies.720p.srt"

"The Office [01.06] The Fight.srt"

I need to extract the 01 and 06 values of these strings using python. But I’m not able to mount a regexp that works for my case

#encoding: utf-8
import re
template = "The.Office.US.S{SE}E{EP}.The.Dundies.720p.srt"
arq = "The.Office.US.S01E06.The.Dundies.720p.srt"

#Nesta linha que está minha dificuldade
pat = re.compile('\{.*?\}')

season, episode = re.findall(pat, text)
print("Temporada: ", season)
print("Episódio: ", episode)

2 answers

2

You have to turn your template into a regex. As it has several characters that are special to a regular expression, you first need to escape them. Then just replace {SE} and {EP} by a group catching [0-9]+ and ready. The code below does this:

def template2regex(template):
    template = re.escape(template)

    regex = template.replace('\{SE\}', '(?P<season>[0-9]+)')
    regex = regex.replace('\{EP\}', '(?P<episode>[0-9]+)')

    return re.compile(regex)

    template = "The.Office.US.S{SE}E{EP}.The.Dundies.720p.srt"
    regex = template2regex(template)
    regex.search('The.Office.US.S01E06.The.Dundies.720p.srt').groups()
('01', '06')
    template = 'The Office [{SE}.{EP}] The Fight.srt'
    regex = template2regex(template)
    regex.search('The Office [01.06] The Fight.srt').groups()
('01', '06')

Browser other questions tagged python regex

You are not signed in. Login or sign up in order to post.

by Sidon • **6,563** points · Answer 1 · 2017-06-12T00:25:40+00:00

Edited, final version
Meets 2 templates with one regex. This version results in a tuple with numerical values. As the regex meets the 2 templates, the tuple will always return 4 items, 2 of them None, unless the string is the sum of the two templates. You can also take the values of the groups separately, that is, to know what was between the brackets (template 1), consider the groups: 'Bracket' and 'Point'. For the second template take the groups’S' and 'E'. Observing The code below works with a single regular expression that is actually composed of two [separated by pipe (|)], so it is possible to build a more granular version, with a regex for each template, as exposed shortly afterwards.

import re

s1 = "The Office [01.06] The Fight.srt"
s2 = 'The.Office.US.S01E06.The.Dundies.720p.srt'
padrao = '(?P<Colchete>\\d{2})\\.(?P<Ponto>\\d{2})|(?P<S>\\d{2})E(?P<E>\\d{2})'
re1 = re.compile(padrao)

print ('## Resultado para s1 ##')
print ('Groups: ',re1.search(s1).groups())
print ('Colchete: ', re1.search(s1).group('Colchete'))
print ('Ponto: ', re1.search(s1).group('Ponto'),'\n')

## Resultado para s1 ##
Groups:  ('01', '06', None, None)
Colchete:  01
Ponto:  06

print ('## Resultado para s2 ##')
print ('Groups: ',re1.search(s2).groups())
print ('S: ', re1.search(s2).group('S'))
print ('E: ', re1.search(s2).group('E'))

## Resultado para s2 ##
Groups:  (None, None, '01', '06')
S:  01
E:  06

You can even make a more granular version by breaking the regex into 2 and working separately with the templates, something like this:

padrao1 = '(?P<Colchete>\\d{2})\\.(?P<Ponto>\\d{2})'
padrao2 = '(?P<S>\\d{2})E(?P<E>\\d{2})'

re_p1 = re.compile(padrao1)
re_p2 = re.compile(padrao2)

print ('## Resultados para a versão Granular ##')

print ('## Para s1 ##')
print ('Groups: ',re_p1.search(s1).groups())
print ('Colchete: ', re_p1.search(s1).group('Colchete'))
print ('Ponto: ', re_p1.search(s1).group('Ponto'),'\n')

## Resultados para a versão Granular ##
## Para s1 ##
Groups:  ('01', '06')
Colchete:  01
Ponto:  06 

print ('## Para s2 ##')
print ('Groups: ',re_p2.search(s2).groups())
print ('S: ', re_p2.search(s2).group('S'))
print ('E: ', re_p2.search(s2).group('E'))

## Para s2 ##
Groups:  ('01', '06')
S:  01
E:  06

DEMO