First problem:
your regex will not work with .match
, he demands that you marry completely string with its regex.
Second problem:
Another thing, your file .txt
may be in UTF-8 and so he might not recognize the accents, so if he’s using urllib
(chance read the files remotely) on read()
(of urllib
) from Handler add the .decode('utf-8')
If your document is in ASCII or windows-1252 or iso-8859-1 on open()
add the parameter encoding
:
See the examples at the end of the answer
Third problem:
Your regex is searching for anything that contains spaces before and after, remember sentences can end in scores like .
, !
, ?
, etc and can also be separated by ,
, ;
, :
, or even be isolated by quotation marks "
Your regex should be something like:
r'(\s|^)(ju(iz|íza) relato(r|a)|ju(iz|íza)|relato(r|a)|desembargado(r|ra))[!?",;:.\s]'
- The
\s
at the beginning indicates that it can contain space, line break or tab
- The
[!?",;:.\s]
indicates that there may be scores at the end of the word and the \s
indicates that can spaces spaces, line break or tab at the end of the word.
Example if downloading from a URL
If you are reading from the URL do so:
# -*- coding: utf-8 -*-
import re # importa modulo
import urllib.request # importa modulo
url = "http://m.uploadedit.com/bbtc/1513873742547.txt"
parttern = r'(\s|^)(ju(iz|íza) relato(r|a)|ju(iz|íza)|relato(r|a)|desembargado(r|ra))[!?",;:.\s]'
with urllib.request.urlopen(url) as f:
data = f.read().decode('utf-8')
p = re.compile(parttern)
resultado = p.search(data)
print(resultado)
If the file remote is in windows-1252 or iso-8859-1 use so:
data = f.read().decode('latin1')
Example if reading a file on the machine
If the file .txt
is in utf-8
use encoding='utf-8'
, if it’s windows-1252 or iso-8859-1 use encoding='latin1'
import re
arquivo = 'foobar.txt'
parttern = r'(\s|^)(ju(iz|íza) relato(r|a)|ju(iz|íza)|relato(r|a)|desembargado(r|ra))[!?",;:.\s]'
with open(arquivo, encoding='utf-8') as f:
data = f.read()
p = re.compile(parttern)
resultado = p.search(data)
print(resultado)
maybe your problem is in this
r
before the'ju(iz|íza)
, I seem to generate a syntax error. I’m not sure, but that’s what it looks like.– Paulo Roberto Rosa
I have tried several ways. It is not in the
r
the problem.– André
Have you tried it this way? https://regex101.com/r/FGK6r2/2. For an example of this page we would like to check regex?
– Valdeir Psr
@André Link broken.
Not Found
.– Valdeir Psr
http://m.uploadedit.com/bbtc/1513873742547.txt
– André
Try it this way: https://regex101.com/r/FG6r2/3
– Valdeir Psr
@Valdeirpsr, I answered down here. This regex did not roll.
– André