How to create a regex in Python to get a specific text?

Asked

Viewed 285 times

1

For example, I want to take the string below:

TEXT ABOVE

MR. DR. DR. JUDGE OF LAW SPECIAL COURT CIVIL DISTRICT OF THE CAPITAL

TEXT BELOW

The word DEAR. may be replaced by EXCELLENCY and the word COURT for STICK.

I tried to use the regex below, but I didn’t get much:

re.findall(r'((?:exmo|excelentissimo))', text, re.IGNORECASE)

Does anyone have any idea how I can proceed to get this piece of string completely?

1 answer

1


First of all, you don’t need two sets of parentheses in a row.

When there are parentheses, they form a catch group. That is, what is inside them will be available in a group, if regex finds a match. This is the standard behavior of parentheses.

But when you don’t want to capture a group, and just group sub-expressions, you use the syntax (?:, defining a catch group. Basically, this says that the parentheses pair should not create a group if regex finds a match.

So do ((?:...)) it’s kind of weird. It’s a no-capture group within a capture group, which means you said you want to capture something, and that something is "something I don’t want to capture". Contradictory and unnecessary, so the first thing is to eliminate this (and depending on the case, choose only one of them, if necessary).

Anyway, what’s inside the parentheses is exmo|excelentissimo, which means "the string exmo or the string excelentissimo". This regex takes only this snippet, ignoring the rest of the string.

To get the whole line, just use a regex that picks up all the text:

import re

text = """
TEXTO ACIMA

EXMO. SR. DR. JUÍZ DE DIREITO DO JUIZADO ESPECIAL CÍVEL DA COMARCA DA CAPITAL

TEXTO ABAIXO

EXCELENTISSIMO. SR. DR. JUÍZ DE DIREITO DA VARA ESPECIAL CÍVEL DA COMARCA DA CAPITAL

MAIS TEXTO"""

results = re.findall(r'^(?:EXMO|EXCELENTISSIMO)\. SR\. DR\. JUÍZ DE DIREITO (?:DO JUIZADO|DA VARA) ESPECIAL CÍVEL DA COMARCA DA CAPITAL$',
                     text, re.IGNORECASE | re.MULTILINE)

for result in results:
    print(result)

Notice I used (?:EXMO|EXCELENTISSIMO) and (?:DO JUIZADO|DA VARA) for the parts that can be one or the other. I used catch-no groups - note the syntax (?: - for according to the documentation of findall, when there are capture groups in regex, only the groups are returned. But since you want the entire string, I used catch groups.

The rest of the text, however, as it does not change, may be exactly what you want ("MR. DR. JUDGE etc"). The only exception is the point, since in regex it has special meaning (meaning "any character"), then it must be escaped and written as \..

Another detail are the markers ^ and $, which usually means the beginning and end of the string. But thanks to flag MULTILINE, they are also interpreted as the beginning and end of a line. With this you ensure that the line contains only what is specified in regex.

The code above prints:

EXMO. SR. DR. JUÍZ DE DIREITO DO JUIZADO ESPECIAL CÍVEL DA COMARCA DA CAPITAL
EXCELENTISSIMO. SR. DR. JUÍZ DE DIREITO DA VARA ESPECIAL CÍVEL DA COMARCA DA CAPITAL

How did you use the flag IGNORECASE, regex will also take passages with lower case letters ("Exmo. sr. dr. etc...") and even mixed lower case letters ("Exmo. Sr. Dr ..."). If you want it to be uppercase, remove this flag (but don’t forget to keep the MULTILINE):

results = re.findall(r'^(?:EXMO|EXCELENTISSIMO)\. SR\. DR\. JUÍZ DE DIREITO (?:DO JUIZADO|DA VARA) ESPECIAL CÍVEL DA COMARCA DA CAPITAL$',
                     text, re.MULTILINE)

When I was editing your question, I saw that the text was occupying 3 lines:

EXMO. SR. DR. JUÍZ DE DIREITO
DO JUIZADO ESPECIAL
CÍVEL DA COMARCA DA CAPITAL

I don’t know if it’s like that or if it should be all in one line. In any case, we can exchange the space after "RIGHT" and "SPECIAL" by \s, which considers both space and line break:

results = re.findall(r'^(?:EXMO|EXCELENTISSIMO)\. SR\. DR\. JUÍZ DE DIREITO\s(?:DO JUIZADO|DA VARA) ESPECIAL\sCÍVEL DA COMARCA DA CAPITAL$',
                     text, re.MULTILINE)

If you want, you can use [ \n] (a space or \n - notice that there is a space after the [), since \s also considers other characters, such as TAB and others mentioned in documentation.

  • Got it, my friend. Grateful for the explanations. I have one more question: there are cases where appears SPECIAL CIVEL and only CIVEL. How do I for these cases?

  • 1

    @Antoniobrazfinizola If "SPECIAL" is optional, do (?: ESPECIAL)? CÍVEL (attention to the space before the "SPECIAL"). The ? after parentheses makes all content in parentheses optional.

  • (?: SPECIAL)? didn’t work though ? (?: SPECIAL)? with a first '?' ahead. Grateful!

  • @Antoniobrazfinizola Maybe you have put an extra space (or some other detail), because with me it worked: https://ideone.com/UubJ61

  • Blz, I’ll check it out. I thank you again for the detail. You’ve helped a lot.

  • another question appeared. If I want to capture only the word CIVEL, and after that word can appear other things, how can I do? Like "Mr. Mr. Dr. Law Judge of the Special Civil Court of the District of the Capital" I want to take up to "Mr. Exmo. DR. LAW JUDGE OF SPECIAL CIVIL COURT".

  • @Antoniobrazfinizola Just take out the part that doesn’t matter: https://ideone.com/s1i7IJ

  • @Antoniobrazfinizola As soon as I turn off the computer, I suggest that in case of different doubts, you ask another question, because then you are not depending only on me (since a new question is more visible for anyone to answer, because it appears on the main page, different from the comments, which only appear here) :-) I say this also because the idea of the site is to have a specific problem per question, and this helps to keep the site organized (don’t take this as "laziness to help" on my part) :-)

  • Quiet, all right. Thanks again!

Show 4 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.