Search for disordered word sequence from a list in text

Question

Search for disordered word sequence from a list in text

Asked 7 years, 9 months ago

Viewed 462 times

4

Is there any way to have a list of unordered words and look for if there is a sequence of them in a text?

Example:

lista = ["dia", "noite", "tarde", "é", "está", "bonito", "o", "a", "muito", "feio"]

texto = "Hoje é sábado, vamos sair pois o dia está bonito. Até mais tarde."

Match -> "the day is beautiful"

I can find all the words on the list, but they’re not sorted

lista = ["dia", "noite", "tarde", "é", "está", "bonito", "o", "a", "muito", "feio"]
texto = "Hoje é sábado, vamos sair pois o dia está bonito. Até mais tarde."

frase = []
for palavras in lista:
    if palavras in texto:
        frase.append(palavras)

print (' '.join(frase))

Output:

day afternoon it’s beautiful the a

Even the "a" is showing up I don’t know why!

2

It is necessary to remove the punctuation of your input, in this case "beautiful", it will not be included: https://repl.it/Mp7s . Right? That’s what you want?

– Miguel

2017/10/18 at 13:27
1

@Miguel Thanks, I thought there would be a way to extract "the day is beautiful" which is a sequence with words from the list.

– pitanga

2017/10/18 at 13:34
2

Pitanga... Good exercise (; https://repl.it/Mp7s/4 . Thus we find all sequences (more than one word) in a text

– Miguel

2017/10/18 at 14:29
2

Wow!!! I’m speechless @Miguel I’m going to comment on all this code you made. And then I’m going to try to redo it myself. That’s wonderful, thank you!

– pitanga

2017/10/18 at 15:06
2

You’re welcome... good luck

– Miguel

2017/10/18 at 15:26

1 answer

Browser other questions tagged python python-3.x

You are not signed in. Login or sign up in order to post.

by Isac • **24,736** points · Answer 1 · 2017-10-18T12:54:20+00:00

Even the "a" is showing up I don’t know why!

The code as is passes in every word of lista and see if it exists in the text. And it doesn’t have to exist as a loose word, just exist in the middle and that’s why the a appears:

texto = "Hoje é sábado, vamos sair pois o dia está bonito. Até mais tarde."
#o 'a' está aqui---^---

The operator in Python in this case checks whether the text contains the.

For your goal just reverse the logic of the for going through the text word by word and checking if it exists in the list. This not only solves the problem of a as warrants your order:

lista = ["dia", "noite", "tarde", "é", "está", "bonito", "o", "a", "muito", "feio"]
texto = "Hoje é sábado, vamos sair pois o dia está bonito. Até mais tarde."

frase = []
for palavras in texto.split(' '): #agora texto e com split(' ') para ser palavras
    if palavras in lista: #para cada palavra agora verifica se existe na lista
        frase.append(palavras)

print (' '.join(frase))

See the example in Ideone

Note that dividing words with spaces will catch words with characters like . and ,, getting words like bonito. or tarde., causing the code not to find them

You can get around this problem in many ways. One of the simplest is to remove these markers before analyzing:

texto2 = texto.replace('.','').replace(',','');

See Ideone how it looks with this pre analysis

You can even do something more generic and create a list of scorecards to remove and remove through a custom function:

def retirar(texto, careteres):
    for c in careteres:
        texto = texto.replace(c, '')

    return texto

And now use this function over the original text:

texto2 = retirar(texto, ".,");