cannot cut and separate using special characters such as square brackets, asterisks, and mathematical operators
Of course you can. But instead of making a split
, I find it easier to pick up only the snippets that you want (in case, everything is in brackets):
import re
texto = r"[1232131] testando [teste2] [teste3] e [teste4]"
for i, pedaco in enumerate(re.findall(r'\[([^]]+)\]', texto), start=1):
print(f"Split {i}: {pedaco}")
The expression used was \[([^]]+)\]
. Halving:
- It begins with
\[
and ends with \]
, that is, she takes every stretch that begins with a [
and ends with a ]
. Except that brackets have special significance in regex, so I need to make the escape with \
so that they are considered common characters
- Between the brackets we have
[^]]
, which is a character class denied, which takes any character other than ]
. Then we have the quantifier +
, indicating "one or more occurrences"
That is, the regex takes a [
, followed by one or more characters that are not ]
, followed by a ]
. In addition, the part that corresponds to the characters that are not ]
is in parentheses, which forms a catch group. And when there are capture groups in regex, findall
returns only the groups.
Also use enumerate
to already have the indexes together with the values returned by findall
, so you don’t have to be incremental i
(and used f-string to format the output, available from Python 3.6 - but can also continue using format
if you want to). The exit will be:
Split 1: 1232131
Split 2: teste2
Split 3: teste3
Split 4: teste4
You can do it with split
, but then you would have to separate not only by square brackets, but also by the whole text between each pair of square brackets, which in my opinion would be much more complicated, so I found it simpler for you to define what you want to pick up instead of saying how you want to separate.
After all, split and match are two sides of the same coin: in the first you say what you do not want (text that is not between brackets) and separate the data according to this criterion, in the second you say what you want (text between brackets) and get only this. And there are situations where defining one is easier than the other - in this case, split
it seems to me more difficult, see:
r = re.compile(r'\][^\[]+\[|[\[\]]')
for i, pedaco in enumerate(filter(lambda s : len(s) > 0, r.split(texto)), start=1):
print(f"Split {i}: {pedaco}")
The idea is to do the split
by a ]
followed by several characters that are not [
, followed by a [
, or by a bracket (either opening or closing) alone. Only this causes the result to have empty strings (when the separator is at the beginning or end of the string - as explained in the documentation), then I need to filter these result values using filter
.
If you want to be more specific, you can switch to something like re.findall(r'\[([\w]+)\]', texto)
- in the case, \w
is a shortcut for "letters, digits or the character _
".
This is a little more restricted, since [^]]
picks up any character other than ]
(any of the same, including punctuation marks, spaces, emojis, line breaks, etc.). If you want to be more restricted, just adjust the regex accordingly (no way, a simpler regex may end up picking up more things, but a more restrictive one may end up getting more complicated - it’s up to you to choose what makes more sense according to the data you have).
Regex-free
Another alternative is to not use regex, and instead use method find
to search through the brackets, and then get the substring between the positions of these:
def texto_entre_colchetes(texto):
inicio = 0
while True:
inicio = texto.find('[', inicio)
if inicio == -1:
break
fim = texto.find(']', inicio + 1)
if fim == -1:
break
yield texto[inicio + 1: fim]
inicio = fim + 1
texto = r"[1232131] testando [teste2] [teste3] e [teste4]"
for i, pedaco in enumerate(texto_entre_colchetes(texto), start=1):
print(f"Split {i}: {pedaco}")
I use the second parameter of find
, which is the position in which the search begins, so I can search from the last found bracket. When there is no more, find
returns -1
and I can close the loop.
About the use of yield
above, read here to better understand.
Your information is wrong, you can capture brackets and special characters with regex. For this use the character
\\
to escape special characters– Augusto Vasques