Catch string inside a text file using regex in Python

Asked

Viewed 534 times

2

I would like to know how to get a string in this ' XXXX ' format inside a text file, using regex. I have tried several methods but without success:

import re
f = open('infos', 'r')
padrao = re.findall(r'\sSSBR\s', f)
if padrao in f:
    print(padrao)
else:
    print("Padrão não encontrado!")

When executed returns this error:

Traceback (most recent call last):
  File "analiseInfos.py", line 3, in <module>
    padrao = re.findall(r'\sSSBR\s', f)
  File "C:\Users\Matheus\AppData\Local\Programs\Python\Python37\lib\re.py", line
 223, in findall
    return _compile(pattern, flags).findall(string)
TypeError: expected string or bytes-like object

Excerpt from the archive 'Infos':

20190401 00:01:25.371 00000084 SSBR 186701000125370c 0000 0000 D P C S N 0 3 RV 1e->000010413400000000000SSSSBR1010000091107200KP006287927455DF0

20190401 00:01:25.729 000000ff SSBR 175601000125370c 0000 0001 D R S S N 0 3 RV 1e<-000010413400000000000SSBRSS1022300091000107200KP006287927455

20190401 00:01:26.984 00000076 SSBR 176401000125984c 0000 0000 D P C S N 0 3 RV 1e->000011413400000000000SSSSBR1010000091107200CJ003363907455DF0

20190401 00:01:27.700 000000ff SSBR 190401000126984c 0000 0001 D R S S N 0 3 RV 1e<-000011413400000000000SSBRSS1236500091000107200CJ003363907455
  • Post a snippet of what you have inside the Indian archive

  • All right, buddy!!

1 answer

2


The function open returns a file object, while the function findall should receive a string. That’s what the error message is saying:

Typeerror: expected string or bytes-like Object

You passed the return of open (namely, a file Object), instead of a string.

To check the contents of the file, you must first use the file Object to read the contents of the file and get it as a string. Then you pass that string to findall. I also recommend using with, because it already closes the file automatically:

with open('infos', 'r') as f:  
    for line in f: # para cada linha do arquivo
        print(re.findall(r'\sSSBR\s', line))

Remembering that findall returns a list of regex occurrences in the string in question, so you just print it to get the results (if you have nothing, returns an empty list).

The code above makes a loop by all lines of the file, and for each one, checks the regex in question. But if you want, you can also put all the file contents at once in a single string, and then use regex:

with open('infos', 'r') as f:  
    tudo = f.read()

print(re.findall(r'\sSSBR\s', tudo))

But for very large files, loading all at once can consume a lot of memory, so it’s best to use the first approach, to read one line at a time.


Just remembering that findall returns a list of snippets found in the string. But its regex contains a "fixed" text (the letters "SSBR", exactly in this order, and with a space before and after), then the return of findall will be a list with one or more strings " SSBR " (or an empty list if not found).

If you just want to know whether the line contains "SSBR" or not, you can use search:

with open('infos', 'r') as f:
    for line in f:
        if re.search(r'\sSSBR\s', line):
            print('linha contém SSBR')
        else:
            print('linha não contém SSBR')

When using the same regex several times, it is interesting to compile it before using the method compile:

r = re.compile(r'\sSSBR\s')
with open('infos', 'r') as f:
    for line in f:
        if r.search(line):
            print('linha contém SSBR')
        else:
            print('linha não contém SSBR')

So you reuse the regex, because it does not need to be recompiled several times within the loop (although the documentation cites that there is a cache most recently used regex programs, so for small programs and/or with a few regex that are not often used it won’t make as much difference).


Another detail is that you used \s (which corresponds to spaces, TAB and line breaks, see the documentation for the full list), and the spaces are part of the return of findall (that is, it will return " SSBR ", with the spaces before and after). If you want only the string "SSBR" to be in the results, you can change the regex to r'\s(SSBR)\s': the parentheses form a catch group and when these are present, findall returns only the groups.

Or you can use r'\bSSBR\b'. The \b means "word Boundary" (something like "boundary between words"), and corresponds to positions in which there is an alphanumeric character before and a non-alphinical character after (or vice versa). That is, it takes the string "SSBR" even if it has other things besides \s before or after (such as punctuation marks, the beginning or end of the string, etc).

  • 1

    Thank you very much! That’s right!

Browser other questions tagged

You are not signed in. Login or sign up in order to post.