Regex in Python to find several possible names

Asked

Viewed 747 times

-1

I need to find the name of the judge in a file of labor process, but first I need to know if he is Judge(a), Rapporteur(a), Judge(a) Rapporteur(a) or Judge(a).

I’m using the following Regex:

f_regex = re.compile(r'ju(iz|íza) relato(r|a) | ju(iz|íza) | relato(r|a) | desembargado(r|ra)')

But it’s not working.

EDIT:

Problem solved. The problem was not in the regex but within a function of mine. Sorry for the inconvenience, I also did not know what was happening. Thanks to all who have gone out of their time to help me, truly.

  • maybe your problem is in this r before the 'ju(iz|íza), I seem to generate a syntax error. I’m not sure, but that’s what it looks like.

  • I have tried several ways. It is not in the r the problem.

  • 2

    Have you tried it this way? https://regex101.com/r/FGK6r2/2. For an example of this page we would like to check regex?

  • @André Link broken. Not Found.

  • http://m.uploadedit.com/bbtc/1513873742547.txt

  • Try it this way: https://regex101.com/r/FG6r2/3

  • @Valdeirpsr, I answered down here. This regex did not roll.

Show 2 more comments

4 answers

0

Guys, a lot of the problems were because of the spaces.

Folder with some txts: https://drive.google.com/drive/folders/1aqUjO4x3cvmKFYJJZJKV8dMmDxGGBOqq?usp=sharing

The following function picks up much (judge, rapporteur, judge), but is not getting judge rapporteur. Maybe the problem is in the masculino|feminino:

def find_juiz(file):
file_lines = list(reversed(line_tokenize(file)))[:10]
file_chunked = str(file_lines)
name_juiz = ''
search = re.search(r'ju(iz|íza)\s*relato(r|ra)|ju(iz|íza)|relato(r|ra)|desembargado(r|ra)',file_chunked)
if search is not None:
    for i,line in enumerate(file_lines):
        if line.strip() in search.group():
            # while file_lines[i+1] is not None:
            #     j=i
            #     name_juiz += file_lines[j+1]
            #     j+=1
            # return name_juiz
            return i,line.strip()
else:
    return

ps: that line_tokenize comes from the package nltk(Natural Language Tool Kit), which is a package to work with NLP (Natural Language Processing) in Python. It takes a text and separates it into a list of lines, where each position is a line. Since it’s a pattern that the judges' names are at the end I reversed that list with reversed and picked up the last (which are now the first) 10 lines (list(reversed(line_tokenize(file)))[:10])

0

Following the text published in the question comment: http://m.uploadedit.com/bbtc/1513873742547.txt

You can capture this information with regex \b(.+)(?:\n.*)(?:relatora?|desembargadora?|ju[íi]za?)

Explanation:

\b(.+) => Here we capture a pattern at the beginning. How we define .+, it will pick up all the content until the line break. (Item below)

(?:\n.*) => Here we inform the algorithm to capture all the code of the next line.

(?:rapporteur?| chief judge?| ju[íi]za?) => Here we filter a few words. We add a ? to inform that the word prior to it is optional.

?: => This option we use to prevent these data from being captured, we want this group to be only validated.

Regex in operation

  • Did not find anything with this regex Valdeir.

0

First problem:

your regex will not work with .match, he demands that you marry completely string with its regex.

Second problem:

Another thing, your file .txt may be in UTF-8 and so he might not recognize the accents, so if he’s using urllib (chance read the files remotely) on read() (of urllib) from Handler add the .decode('utf-8')

If your document is in ASCII or windows-1252 or iso-8859-1 on open() add the parameter encoding:

See the examples at the end of the answer

Third problem:

Your regex is searching for anything that contains spaces before and after, remember sentences can end in scores like ., !, ?, etc and can also be separated by ,, ;, :, or even be isolated by quotation marks "

Your regex should be something like:

r'(\s|^)(ju(iz|íza) relato(r|a)|ju(iz|íza)|relato(r|a)|desembargado(r|ra))[!?",;:.\s]'
  • The \s at the beginning indicates that it can contain space, line break or tab
  • The [!?",;:.\s] indicates that there may be scores at the end of the word and the \s indicates that can spaces spaces, line break or tab at the end of the word.

Example if downloading from a URL

If you are reading from the URL do so:

# -*- coding: utf-8 -*-

import re             # importa modulo
import urllib.request # importa modulo

url = "http://m.uploadedit.com/bbtc/1513873742547.txt"
parttern = r'(\s|^)(ju(iz|íza) relato(r|a)|ju(iz|íza)|relato(r|a)|desembargado(r|ra))[!?",;:.\s]'

with urllib.request.urlopen(url) as f:
    data = f.read().decode('utf-8')

    p = re.compile(parttern)
    resultado = p.search(data)

    print(resultado)

If the file remote is in windows-1252 or iso-8859-1 use so:

data = f.read().decode('latin1')

Example if reading a file on the machine

If the file .txt is in utf-8 use encoding='utf-8', if it’s windows-1252 or iso-8859-1 use encoding='latin1'

import re

arquivo = 'foobar.txt'
parttern = r'(\s|^)(ju(iz|íza) relato(r|a)|ju(iz|íza)|relato(r|a)|desembargado(r|ra))[!?",;:.\s]'

with open(arquivo, encoding='utf-8') as f:
    data = f.read()

    p = re.compile(parttern)
    resultado = p.search(data)

    print(resultado)
  • So William, actually the regex I’m wearing is working. It just does not catch rapporteur judge among the txts that I left shared on google drive. rest is picking up everything. That his regex did not work, caught some rapporteurs and some judges.

  • @André I took the test exactly like this and it worked well, you understood the part of enconding? It is not my regex that did not work, it is your enconding that should not be adjusted. Please try again.

  • Yes, I was using encoding from the beginning. I already figured out what the problem was. It wasn’t on the regex but on the line if line.strip() in search.group():. The opposite is right: if search.group() in line.strip():. In fact, this simple regex works: r'ju(iz|íza)|relato(r|ra)|desembargado(r|ra)'

  • @André this group you did not mention in the question, there was no way anyone could imagine where the problem had been right?

  • Yes, I really thought it was in regex. Then I answered here sending the function code, where the error was. Excuse the error.

0

Hello, I created a version below with small changes in your regex:

import re

f_regex = re.compile(r'^(\b|.+\s)ju(iz|íza) |^(\b|.+\s)relato(r|a) |^(\b|.+\s)desembargado(r|ra) ', re.IGNORECASE)


success = ["juíza Fulana", "Juiz Fulano", "desembargador Fulano", "Desembargadora Fulana", "Sr. Juiz de Tal"]
fail = ["Fulana de Tal", "Fulano de Tal", "Fulano de Tal", "Juízane Fulana de Tal", "Juizo de Tal", "Dajuiz de Tal"]

print("\nDeve encontrar:")
for string in success:
    result = f_regex.match(string)
    print(string,'- encontrou?',result!=None)

print("\nNao deve encontrar:")
for string in fail:
    result = f_regex.match(string)
    print(string,'- encontrou?',result!=None)

I created some strings to test as well. I hope it helps.

  • Also did not happen Dhiogo

  • You have how to pass some example of the data that are not working?

  • I answered down here

Browser other questions tagged

You are not signed in. Login or sign up in order to post.