Python - Dividing words delimited by blank space or square bracket

Asked

Viewed 819 times

1

I have a string with several words inside, some words are separated by space, but some are compound words and are protected by square brackets.

EX:

string = "Goiânia Vitória Brasília [Campo Grande] Fortaleza [São Paulo] Manaus"

I need to separate these words by returning a list of them separately.

EX OUTPUT:

"Goiânia"

"Victory"

"Brasilia"

"Campo Grande"

"Fortress"

"São Paulo"

"Manaus"

How do I create a regular expression that does this in python?

  • 3

    William, attentive to places where the apostrophe is part of the denomination. For example, Santa Bárbara d'Oeste.

  • William, since you edited the question, why don’t you take advantage and answer José’s comment about cities with compound names and an apostrophe? What should happen in this case, because you would have something like "'Santa Bárbara d'Oeste'"? Wouldn’t it be interesting to get the full name of the city? This quote in the name would be treated in some way?

  • Thanks José and Anderson, I think the ideal would be to change instead of using apostrophe it would be good to define as word delimiter compound keys or brackets, getting something like this. EX: string="Goiânia Vitória Brasília [Campo Grande] Fortaleza [São Paulo] Manaus". I will update the question!

  • I added [Santa Bárbara d'Oeste] in the case of the answer, see if this regex helps you \[(.*?)\]|(\S+)

1 answer

2


Well, the idea is basically to work with clusters.

The first step is to identify the data pattern to mount the proper regex.

Based on the information provided I identified the following pattern:

\[(.*?)\]|(\S+)

Basically it’s any grouping between [], or (|) any grouping of words.

You can test regex in real time on Blush,

This regex will basically return you match case in group 1 the names between [], and in group 2 the other words.

Using the python3 programming language would look something like:

import re
text = """Goiânia Vitória Brasília [Campo Grande] Fortaleza [São Paulo] Manaus [Santa Bárbara d'Oeste]"""
regex = re.compile('\[(.*?)\]|(\S+)')
matches = regex.finditer(text)
for match in matches:
    if(match.group(1) is None):
        print(match.group(2))
    else:
        print(match.group(1))

See working on Ideone

Browser other questions tagged

You are not signed in. Login or sign up in order to post.