Search for disordered word sequence from a list in text

Asked

Viewed 462 times

4

Is there any way to have a list of unordered words and look for if there is a sequence of them in a text?

Example:

lista = ["dia", "noite", "tarde", "é", "está", "bonito", "o", "a", "muito", "feio"]

texto = "Hoje é sábado, vamos sair pois o dia está bonito. Até mais tarde."

Match -> "the day is beautiful"

I can find all the words on the list, but they’re not sorted

lista = ["dia", "noite", "tarde", "é", "está", "bonito", "o", "a", "muito", "feio"]
texto = "Hoje é sábado, vamos sair pois o dia está bonito. Até mais tarde."

frase = []
for palavras in lista:
    if palavras in texto:
        frase.append(palavras)

print (' '.join(frase))

Output:

day afternoon it’s beautiful the a

Even the "a" is showing up I don’t know why!

  • 2

    It is necessary to remove the punctuation of your input, in this case "beautiful", it will not be included: https://repl.it/Mp7s . Right? That’s what you want?

  • 1

    @Miguel Thanks, I thought there would be a way to extract "the day is beautiful" which is a sequence with words from the list.

  • 2

    Pitanga... Good exercise (; https://repl.it/Mp7s/4 . Thus we find all sequences (more than one word) in a text

  • 2

    Wow!!! I’m speechless @Miguel I’m going to comment on all this code you made. And then I’m going to try to redo it myself. That’s wonderful, thank you!

  • 2

    You’re welcome... good luck

1 answer

4


Even the "a" is showing up I don’t know why!

The code as is passes in every word of lista and see if it exists in the text. And it doesn’t have to exist as a loose word, just exist in the middle and that’s why the a appears:

texto = "Hoje é sábado, vamos sair pois o dia está bonito. Até mais tarde."
#o 'a' está aqui---^---

The operator in Python in this case checks whether the text contains the.

For your goal just reverse the logic of the for going through the text word by word and checking if it exists in the list. This not only solves the problem of a as warrants your order:

lista = ["dia", "noite", "tarde", "é", "está", "bonito", "o", "a", "muito", "feio"]
texto = "Hoje é sábado, vamos sair pois o dia está bonito. Até mais tarde."

frase = []
for palavras in texto.split(' '): #agora texto e com split(' ') para ser palavras
    if palavras in lista: #para cada palavra agora verifica se existe na lista
        frase.append(palavras)

print (' '.join(frase))

See the example in Ideone

Note that dividing words with spaces will catch words with characters like . and ,, getting words like bonito. or tarde., causing the code not to find them

You can get around this problem in many ways. One of the simplest is to remove these markers before analyzing:

texto2 = texto.replace('.','').replace(',','');

See Ideone how it looks with this pre analysis

You can even do something more generic and create a list of scorecards to remove and remove through a custom function:

def retirar(texto, careteres):
    for c in careteres:
        texto = texto.replace(c, '')

    return texto

And now use this function over the original text:

texto2 = retirar(texto, ".,");

See also this example in Ideone

  • I understood the explanation about the "a"! Thank you! But in case the output of the program is: "it is the day it is" and the "it is" is not followed by any word from the list and moreover the "beautiful" was not recognized. I thought there would be a way to extract "the day is beautiful" which is a word sequence from the list. Thank you!

  • 1

    @The problem has to do with the , and the . which remain in words when the text is broken into words. It can solve this problem in numerous ways, in which the simplest way would be to remove these separators before interpreting the words.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.