In regex, brackets define a character class (that is, what is inside them is a list of possible characters). So [ ,2020().!]
means "a space, or a comma, or the digit 2
, or the digit 0
, or the digit 2
(again, so it is redundant to have it here again), etc...". The detail is that all this expression corresponds to only one single character (anyone indicated inside the brackets).
This means that any of these characters is treated as a separator. If you only have one space, it separates and creates an element in the final result. If you then have one 2
, another separation is made, then if there is a point, another separation, and so on.
So the result of split
is quite different from what you need, because in case you have 2020
, he sees that he has a 2 and then a zero, and he does the split
between them (and as there is nothing between them, the list is full of empty strings). That is, if you do this:
print(re.split("[ ,2020().!]", texto))
You will see that the resulting list is:
['\nEm', '', '', '', '', '', 'observamos', '', 'e', 'catalogamos', '', 'com', 'fotografias', '', '', 'os', 'barcos', 'que', 'chegaram', 'ao', 'Porto', '', '\nAté', 'breve', '\n']
And as in your function you only delete the spaces (the strings that are ' '
, which is different from an empty string, like the various ''
seen above), so your list has several empty strings, plus line breaks (the \n
that you see in some places above). That’s why the output ends up having so many blank lines.
If the idea is to take only words (whereas "word" is a sequence of letters), you can do so:
import re
def gerarPalavras(texto):
return [ s for s in re.split(r'[\W\d]+', texto) if s != '' ]
texto = """
Em 2020 observamos, e catalogamos (com fotografias), os barcos que chegaram ao Porto!
Até breve.
"""
for p in gerarPalavras(texto):
print(p)
The exit is:
Em
observamos
e
catalogamos
com
fotografias
os
barcos
que
chegaram
ao
Porto
Até
breve
The expression [\W\d]+
takes one or more occurrences (indicated by quantifier +
) of [\W\d]
. The shortcut \W
is "any non-alphanumeric character" (i.e., anything other than letter, digit or _
) and the \d
indicates "one digit". Thus, split
breaks into any string other than letters. With this it also eliminates line breaks.
I just had to delete some empty strings that are placed at the beginning and end.
Match instead of split
Another option is to do the opposite: instead of saying what you don’t want (characters that aren’t letters) and do the split
, you can say whatever you want (ie I say what I want the regex to find in each match) and use findall
:
def gerarPalavras(texto):
return re.findall(r'[^\W\d]+', texto)
Now I use a character class denied: the [^
indicates that what’s in the brackets are things I don’t want. In this case, it’s \W
(anything other than alphanumeric) or \d
(a digit). If I don’t get these characters, all that’s left are the letters. The result is the same as the previous code, after all, split
and match
are two sides of the same coin: in the split
I say what I don’t want, and match
/find
/search
I say what I want.
If the texts are restricted to our alphabet, another option is:
re.findall('[a-záéíóúãõâêîôûç]+', texto, re.I)
So I put all the possible letters, and the flag re.I
says to consider both upper and lower case.
Of course we use a simpler definition of "word" here, since it does not take into account compound words (such as "hummingbird" and "water-eye"), but if you want to complicate matters further, you can take a look here, here and here.
Very cool! (as usual...)
– JJoao