How to create a regex in Python to get a specific text?

Question

How to create a regex in Python to get a specific text?

Asked 6 years, 10 months ago

Viewed 285 times

1

For example, I want to take the string below:

TEXT ABOVE

MR. DR. DR. JUDGE OF LAW SPECIAL COURT CIVIL DISTRICT OF THE CAPITAL

TEXT BELOW

The word DEAR. may be replaced by EXCELLENCY and the word COURT for STICK.

I tried to use the regex below, but I didn’t get much:

re.findall(r'((?:exmo|excelentissimo))', text, re.IGNORECASE)

Does anyone have any idea how I can proceed to get this piece of string completely?

1 answer

Browser other questions tagged python regex

You are not signed in. Login or sign up in order to post.

by hkotsubo • **55,826** points · Answer 1 · 2019-02-07T22:58:00+00:00

First of all, you don’t need two sets of parentheses in a row.

When there are parentheses, they form a catch group. That is, what is inside them will be available in a group, if regex finds a match. This is the standard behavior of parentheses.

But when you don’t want to capture a group, and just group sub-expressions, you use the syntax (?:, defining a catch group. Basically, this says that the parentheses pair should not create a group if regex finds a match.

So do ((?:...)) it’s kind of weird. It’s a no-capture group within a capture group, which means you said you want to capture something, and that something is "something I don’t want to capture". Contradictory and unnecessary, so the first thing is to eliminate this (and depending on the case, choose only one of them, if necessary).

Anyway, what’s inside the parentheses is exmo|excelentissimo, which means "the string exmo or the string excelentissimo". This regex takes only this snippet, ignoring the rest of the string.

To get the whole line, just use a regex that picks up all the text:

import re

text = """
TEXTO ACIMA

EXMO. SR. DR. JUÍZ DE DIREITO DO JUIZADO ESPECIAL CÍVEL DA COMARCA DA CAPITAL

TEXTO ABAIXO

EXCELENTISSIMO. SR. DR. JUÍZ DE DIREITO DA VARA ESPECIAL CÍVEL DA COMARCA DA CAPITAL

MAIS TEXTO"""

results = re.findall(r'^(?:EXMO|EXCELENTISSIMO)\. SR\. DR\. JUÍZ DE DIREITO (?:DO JUIZADO|DA VARA) ESPECIAL CÍVEL DA COMARCA DA CAPITAL$',
                     text, re.IGNORECASE | re.MULTILINE)

for result in results:
    print(result)

Notice I used (?:EXMO|EXCELENTISSIMO) and (?:DO JUIZADO|DA VARA) for the parts that can be one or the other. I used catch-no groups - note the syntax (?: - for according to the documentation of findall, when there are capture groups in regex, only the groups are returned. But since you want the entire string, I used catch groups.

The rest of the text, however, as it does not change, may be exactly what you want ("MR. DR. JUDGE etc"). The only exception is the point, since in regex it has special meaning (meaning "any character"), then it must be escaped and written as \..

Another detail are the markers ^ and $, which usually means the beginning and end of the string. But thanks to flag MULTILINE, they are also interpreted as the beginning and end of a line. With this you ensure that the line contains only what is specified in regex.

The code above prints:

EXMO. SR. DR. JUÍZ DE DIREITO DO JUIZADO ESPECIAL CÍVEL DA COMARCA DA CAPITAL
EXCELENTISSIMO. SR. DR. JUÍZ DE DIREITO DA VARA ESPECIAL CÍVEL DA COMARCA DA CAPITAL

How did you use the flag IGNORECASE, regex will also take passages with lower case letters ("Exmo. sr. dr. etc...") and even mixed lower case letters ("Exmo. Sr. Dr ..."). If you want it to be uppercase, remove this flag (but don’t forget to keep the MULTILINE):

results = re.findall(r'^(?:EXMO|EXCELENTISSIMO)\. SR\. DR\. JUÍZ DE DIREITO (?:DO JUIZADO|DA VARA) ESPECIAL CÍVEL DA COMARCA DA CAPITAL$',
                     text, re.MULTILINE)

When I was editing your question, I saw that the text was occupying 3 lines:

EXMO. SR. DR. JUÍZ DE DIREITO
DO JUIZADO ESPECIAL
CÍVEL DA COMARCA DA CAPITAL

I don’t know if it’s like that or if it should be all in one line. In any case, we can exchange the space after "RIGHT" and "SPECIAL" by \s, which considers both space and line break:

results = re.findall(r'^(?:EXMO|EXCELENTISSIMO)\. SR\. DR\. JUÍZ DE DIREITO\s(?:DO JUIZADO|DA VARA) ESPECIAL\sCÍVEL DA COMARCA DA CAPITAL$',
                     text, re.MULTILINE)

If you want, you can use [ \n] (a space or \n - notice that there is a space after the [), since \s also considers other characters, such as TAB and others mentioned in documentation.