Regex in Python to find several possible names

Question

Regex in Python to find several possible names

Asked 7 years, 6 months ago

Viewed 747 times

-1

I need to find the name of the judge in a file of labor process, but first I need to know if he is Judge(a), Rapporteur(a), Judge(a) Rapporteur(a) or Judge(a).

I’m using the following Regex:

f_regex = re.compile(r'ju(iz|íza) relato(r|a) | ju(iz|íza) | relato(r|a) | desembargado(r|ra)')

But it’s not working.

EDIT:

Problem solved. The problem was not in the regex but within a function of mine. Sorry for the inconvenience, I also did not know what was happening. Thanks to all who have gone out of their time to help me, truly.

maybe your problem is in this r before the 'ju(iz|íza), I seem to generate a syntax error. I’m not sure, but that’s what it looks like.

– Paulo Roberto Rosa

2017/12/20 at 18:05
I have tried several ways. It is not in the r the problem.

– André

2017/12/20 at 18:32
2

Have you tried it this way? https://regex101.com/r/FGK6r2/2. For an example of this page we would like to check regex?

– Valdeir Psr

2017/12/20 at 18:33
@André Link broken. Not Found.

– Valdeir Psr

2017/12/21 at 16:25
http://m.uploadedit.com/bbtc/1513873742547.txt

– André

2017/12/21 at 16:30
Try it this way: https://regex101.com/r/FG6r2/3

– Valdeir Psr

2017/12/21 at 16:53
@Valdeirpsr, I answered down here. This regex did not roll.

– André

2017/12/22 at 17:45

Show 2 more comments

4 answers

Browser other questions tagged python python-3.x regex

You are not signed in. Login or sign up in order to post.

by André • 9 points · Answer 1 · 2017-12-22T17:19:56+00:00

Guys, a lot of the problems were because of the spaces.

Folder with some txts: https://drive.google.com/drive/folders/1aqUjO4x3cvmKFYJJZJKV8dMmDxGGBOqq?usp=sharing

The following function picks up much (judge, rapporteur, judge), but is not getting judge rapporteur. Maybe the problem is in the masculino|feminino:

def find_juiz(file):
file_lines = list(reversed(line_tokenize(file)))[:10]
file_chunked = str(file_lines)
name_juiz = ''
search = re.search(r'ju(iz|íza)\s*relato(r|ra)|ju(iz|íza)|relato(r|ra)|desembargado(r|ra)',file_chunked)
if search is not None:
    for i,line in enumerate(file_lines):
        if line.strip() in search.group():
            # while file_lines[i+1] is not None:
            #     j=i
            #     name_juiz += file_lines[j+1]
            #     j+=1
            # return name_juiz
            return i,line.strip()
else:
    return

ps: that line_tokenize comes from the package nltk(Natural Language Tool Kit), which is a package to work with NLP (Natural Language Processing) in Python. It takes a text and separates it into a list of lines, where each position is a line. Since it’s a pattern that the judges' names are at the end I reversed that list with reversed and picked up the last (which are now the first) 10 lines (list(reversed(line_tokenize(file)))[:10])

by Valdeir Psr • **10,804** points · Answer 2 · 2017-12-22T17:56:18+00:00

Following the text published in the question comment: http://m.uploadedit.com/bbtc/1513873742547.txt

You can capture this information with regex \b(.+)(?:\n.*)(?:relatora?|desembargadora?|ju[íi]za?)

Explanation:

\b(.+) => Here we capture a pattern at the beginning. How we define .+, it will pick up all the content until the line break. (Item below)

(?:\n.*) => Here we inform the algorithm to capture all the code of the next line.

(?:rapporteur?| chief judge?| ju[íi]za?) => Here we filter a few words. We add a ? to inform that the word prior to it is optional.

?: => This option we use to prevent these data from being captured, we want this group to be only validated.

Regex in operation

by Guilherme Nascimento • **98,651** points · Answer 3 · 2017-12-22T20:00:20+00:00

First problem:

your regex will not work with .match, he demands that you marry completely string with its regex.

Second problem:

Another thing, your file .txt may be in UTF-8 and so he might not recognize the accents, so if he’s using urllib (chance read the files remotely) on read() (of urllib) from Handler add the .decode('utf-8')

If your document is in ASCII or windows-1252 or iso-8859-1 on open() add the parameter encoding:

See the examples at the end of the answer

Third problem:

Your regex is searching for anything that contains spaces before and after, remember sentences can end in scores like ., !, ?, etc and can also be separated by ,, ;, :, or even be isolated by quotation marks "

Your regex should be something like:

r'(\s|^)(ju(iz|íza) relato(r|a)|ju(iz|íza)|relato(r|a)|desembargado(r|ra))[!?",;:.\s]'

The \s at the beginning indicates that it can contain space, line break or tab
The [!?",;:.\s] indicates that there may be scores at the end of the word and the \s indicates that can spaces spaces, line break or tab at the end of the word.

Example if downloading from a URL

If you are reading from the URL do so:

# -*- coding: utf-8 -*-

import re             # importa modulo
import urllib.request # importa modulo

url = "http://m.uploadedit.com/bbtc/1513873742547.txt"
parttern = r'(\s|^)(ju(iz|íza) relato(r|a)|ju(iz|íza)|relato(r|a)|desembargado(r|ra))[!?",;:.\s]'

with urllib.request.urlopen(url) as f:
    data = f.read().decode('utf-8')

    p = re.compile(parttern)
    resultado = p.search(data)

    print(resultado)

If the file remote is in windows-1252 or iso-8859-1 use so:

data = f.read().decode('latin1')

Example if reading a file on the machine

If the file .txt is in utf-8 use encoding='utf-8', if it’s windows-1252 or iso-8859-1 use encoding='latin1'

import re

arquivo = 'foobar.txt'
parttern = r'(\s|^)(ju(iz|íza) relato(r|a)|ju(iz|íza)|relato(r|a)|desembargado(r|ra))[!?",;:.\s]'

with open(arquivo, encoding='utf-8') as f:
    data = f.read()

    p = re.compile(parttern)
    resultado = p.search(data)

    print(resultado)

by Dhiogo Boza • 1 point · Answer 4 · 2017-12-20T18:54:05+00:00

Hello, I created a version below with small changes in your regex:

import re

f_regex = re.compile(r'^(\b|.+\s)ju(iz|íza) |^(\b|.+\s)relato(r|a) |^(\b|.+\s)desembargado(r|ra) ', re.IGNORECASE)


success = ["juíza Fulana", "Juiz Fulano", "desembargador Fulano", "Desembargadora Fulana", "Sr. Juiz de Tal"]
fail = ["Fulana de Tal", "Fulano de Tal", "Fulano de Tal", "Juízane Fulana de Tal", "Juizo de Tal", "Dajuiz de Tal"]

print("\nDeve encontrar:")
for string in success:
    result = f_regex.match(string)
    print(string,'- encontrou?',result!=None)

print("\nNao deve encontrar:")
for string in fail:
    result = f_regex.match(string)
    print(string,'- encontrou?',result!=None)

I created some strings to test as well. I hope it helps.