Only get snippets between brackets

Asked

Viewed 64 times

-1

In third-party software, a log record is generated with separate information through words that comes within a bracket.

Example: [informação1] [informação2] [informação3]

The problem is that the library re of Python does not accept that I do the split of that information by separating based on the brackets, example:

If I wanted to get everything inside the brackets, it will present me with an error because it cannot make the cuts and separation using special characters such as square brackets, asterisks, and mathematical operators.

In this case there is no way to change the software so that the information is not delivered through the parentheses.

Does anyone have any idea how to get around this problem?

 import re
    
    padrao = ''
    texto = r"[1232131] testando [teste2] [teste3] e [teste4]"
    
    #divide o texto numa lista de acordo com o padrão
    saida = re.split(padrao, texto)
    i = 1
    for pedaco in saida:
        print("Split {0}: {1}".format(i, pedaco))
        i+=1
  • Your information is wrong, you can capture brackets and special characters with regex. For this use the character \\ to escape special characters

1 answer

3


cannot cut and separate using special characters such as square brackets, asterisks, and mathematical operators

Of course you can. But instead of making a split, I find it easier to pick up only the snippets that you want (in case, everything is in brackets):

import re
    
texto = r"[1232131] testando [teste2] [teste3] e [teste4]"

for i, pedaco in enumerate(re.findall(r'\[([^]]+)\]', texto), start=1):
    print(f"Split {i}: {pedaco}")

The expression used was \[([^]]+)\]. Halving:

  • It begins with \[ and ends with \], that is, she takes every stretch that begins with a [ and ends with a ]. Except that brackets have special significance in regex, so I need to make the escape with \ so that they are considered common characters
  • Between the brackets we have [^]], which is a character class denied, which takes any character other than ]. Then we have the quantifier +, indicating "one or more occurrences"

That is, the regex takes a [, followed by one or more characters that are not ], followed by a ]. In addition, the part that corresponds to the characters that are not ] is in parentheses, which forms a catch group. And when there are capture groups in regex, findall returns only the groups.

Also use enumerate to already have the indexes together with the values returned by findall, so you don’t have to be incremental i (and used f-string to format the output, available from Python 3.6 - but can also continue using format if you want to). The exit will be:

Split 1: 1232131
Split 2: teste2
Split 3: teste3
Split 4: teste4

You can do it with split, but then you would have to separate not only by square brackets, but also by the whole text between each pair of square brackets, which in my opinion would be much more complicated, so I found it simpler for you to define what you want to pick up instead of saying how you want to separate.

After all, split and match are two sides of the same coin: in the first you say what you do not want (text that is not between brackets) and separate the data according to this criterion, in the second you say what you want (text between brackets) and get only this. And there are situations where defining one is easier than the other - in this case, split it seems to me more difficult, see:

r = re.compile(r'\][^\[]+\[|[\[\]]')
for i, pedaco in enumerate(filter(lambda s : len(s) > 0, r.split(texto)), start=1):
    print(f"Split {i}: {pedaco}")

The idea is to do the split by a ] followed by several characters that are not [, followed by a [, or by a bracket (either opening or closing) alone. Only this causes the result to have empty strings (when the separator is at the beginning or end of the string - as explained in the documentation), then I need to filter these result values using filter.


If you want to be more specific, you can switch to something like re.findall(r'\[([\w]+)\]', texto) - in the case, \w is a shortcut for "letters, digits or the character _".

This is a little more restricted, since [^]] picks up any character other than ] (any of the same, including punctuation marks, spaces, emojis, line breaks, etc.). If you want to be more restricted, just adjust the regex accordingly (no way, a simpler regex may end up picking up more things, but a more restrictive one may end up getting more complicated - it’s up to you to choose what makes more sense according to the data you have).


Regex-free

Another alternative is to not use regex, and instead use method find to search through the brackets, and then get the substring between the positions of these:

def texto_entre_colchetes(texto):
    inicio = 0
    while True:
        inicio = texto.find('[', inicio)
        if inicio == -1:
            break
        fim = texto.find(']', inicio + 1)
        if fim == -1:
            break
        yield texto[inicio + 1: fim]
        inicio = fim + 1

texto = r"[1232131] testando [teste2] [teste3] e [teste4]"

for i, pedaco in enumerate(texto_entre_colchetes(texto), start=1):
    print(f"Split {i}: {pedaco}")

I use the second parameter of find, which is the position in which the search begins, so I can search from the last found bracket. When there is no more, find returns -1 and I can close the loop.

About the use of yield above, read here to better understand.

  • 1

    I spoke parentheses up there, but it was rsrs brackets. Perfect guy, solved my problem. Thank you @hkotsubo.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.