How to use re.split() to pick only the words of a text, ignoring numbers and punctuation marks

Asked

Viewed 311 times

3

I am given the following sentence::

texto = """
Em 2020 observamos, e catalogamos (com fotografias), os barcos que chegaram ao Porto! 
Até breve.
"""
for p in gerarPalavras(texto):
 print(p)

I created the following code to separate the words and put each one in a different line:

import re

def gerarPalavras(texto):
    fraseFinal=[]
    frase = re.split("[ ,2020().!]" ,texto)

    for i in frase:
        if i != ' ':
            fraseFinal.append(i)
            
    return fraseFinal

The desirable output is that a word per line appears, freeing me from the punctuation and numbers.

In my output this happens, however, inadvertently I am creating spaces without knowing how:

Em





observamos

e
catalogamos

com
fotografias


os
barcos
que
chegaram
ao
Porto


Até
breve

2 answers

4


In regex, brackets define a character class (that is, what is inside them is a list of possible characters). So [ ,2020().!] means "a space, or a comma, or the digit 2, or the digit 0, or the digit 2 (again, so it is redundant to have it here again), etc...". The detail is that all this expression corresponds to only one single character (anyone indicated inside the brackets).

This means that any of these characters is treated as a separator. If you only have one space, it separates and creates an element in the final result. If you then have one 2, another separation is made, then if there is a point, another separation, and so on.

So the result of split is quite different from what you need, because in case you have 2020, he sees that he has a 2 and then a zero, and he does the split between them (and as there is nothing between them, the list is full of empty strings). That is, if you do this:

print(re.split("[ ,2020().!]", texto))

You will see that the resulting list is:

['\nEm', '', '', '', '', '', 'observamos', '', 'e', 'catalogamos', '', 'com', 'fotografias', '', '', 'os', 'barcos', 'que', 'chegaram', 'ao', 'Porto', '', '\nAté', 'breve', '\n']

And as in your function you only delete the spaces (the strings that are ' ', which is different from an empty string, like the various '' seen above), so your list has several empty strings, plus line breaks (the \n that you see in some places above). That’s why the output ends up having so many blank lines.


If the idea is to take only words (whereas "word" is a sequence of letters), you can do so:

import re

def gerarPalavras(texto):
    return [ s for s in re.split(r'[\W\d]+', texto) if s != '' ]

texto = """
Em 2020 observamos, e catalogamos (com fotografias), os barcos que chegaram ao Porto! 
Até breve.
"""
for p in gerarPalavras(texto):
    print(p)

The exit is:

Em
observamos
e
catalogamos
com
fotografias
os
barcos
que
chegaram
ao
Porto
Até
breve

The expression [\W\d]+ takes one or more occurrences (indicated by quantifier +) of [\W\d]. The shortcut \W is "any non-alphanumeric character" (i.e., anything other than letter, digit or _) and the \d indicates "one digit". Thus, split breaks into any string other than letters. With this it also eliminates line breaks.

I just had to delete some empty strings that are placed at the beginning and end.


Match instead of split

Another option is to do the opposite: instead of saying what you don’t want (characters that aren’t letters) and do the split, you can say whatever you want (ie I say what I want the regex to find in each match) and use findall:

def gerarPalavras(texto):
    return re.findall(r'[^\W\d]+', texto)

Now I use a character class denied: the [^ indicates that what’s in the brackets are things I don’t want. In this case, it’s \W (anything other than alphanumeric) or \d (a digit). If I don’t get these characters, all that’s left are the letters. The result is the same as the previous code, after all, split and match are two sides of the same coin: in the split I say what I don’t want, and match/find/search I say what I want.


If the texts are restricted to our alphabet, another option is:

re.findall('[a-záéíóúãõâêîôûç]+', texto, re.I)

So I put all the possible letters, and the flag re.I says to consider both upper and lower case.


Of course we use a simpler definition of "word" here, since it does not take into account compound words (such as "hummingbird" and "water-eye"), but if you want to complicate matters further, you can take a look here, here and here.

  • 1

    Very cool! (as usual...)

3

For language processing, it is customary to start with tokenization (split text in its basic elements: words, scores, etc) -- broader problem than the request.

In the example below follows a simplified version of the tokenizer that I normally use, to present the way re.X that allows Exp.reg. more ease to read, adjust and document.

import re;
    
texto = """Em 2020 observamos, e catalogamos (com fotografias), os barcos ... ao Porto!
Até breve."""

print(re.findall(r'''
     \b\w[\w\-.]*\w\b      # palavras: 2020 barcos  ver-se dir-se-ia  file.txt
   | \w                    #
   | \.\.\.                # ...
   | [,.:;?!()[\]]         # pontuação
   | \S
       ''',texto,re.X))

As expected, the exit is:

['Em', '2020', 'observamos', ',', 'e', 'catalogamos', '(', 'com', 'fotografias', ')', 
',', 'os', 'barcos', '...', 'ao', 'Porto', '!', 'Até', 'breve', '.']

Browser other questions tagged

You are not signed in. Login or sign up in order to post.