How to remove unwanted words from a text?

Asked

Viewed 3,678 times

3

I am trying to remove unwanted words from any text, but it takes away from other words. For example:

remover_palavras = ["a", "e", "o"]

The program returns: btt (potato), mns (less)

What to do?

  • Can you elaborate on your problem? Do you want to remove the letters "a", "e" and "o" only when they are alone? By the way, edit the question and add your code.

  • Beware of such substitution, he doesn’t have much abundance.

  • @Andersoncarloswoss, yes when they are alone. I want my program to inform the stopWords, like: of, if, him, you, that, etc.

  • You can give an example of an input and how you want to output?

3 answers

5

If it’s a simple joke, you can create an algorithm that does the following:

  1. Create a list of words to be removed from the text.

  2. Create a list (lista_frase) where each element in the list is a word from its original phrase.

  3. Create a second list (result), selecting items from the first list (lista_frase) which are not in the list of deleted words (remover_palavras).

  4. Joins all elements of the resulting list by separating them by a space.

Code example:

frase = 'Oi, eu sou Goku e estou indo para a minha casa'

remover_palavras  = ['a', 'e']
lista_frase = frase.split()

result = [palavra for palavra in lista_frase if palavra.lower() not in remover_palavras]

retorno = ' '.join(result)
print(retorno)

The exit will be

Hi, I’m Goku I’m going to my house

See working on repl.it

0

For me the best way would be with Regular Expressions:

import re

text = 'Oi, eu sou Goku e estou indo para a minha casa'
palavras = ['a','e']

for i in palavras:
    text = re.sub(r'\s'+i+'([\s,\.])',r'\1',text) 

print(text)

I find it interesting that if there is any score that it is maintained, but then it will be of interest to you.

-1

I’m a beginner in Python, but function that solves your problem.

Function

def remover_palavra(palavra, remover):
    remover_tamanho = len(remover)
    palavra_tamanho = len(palavra)
    while True:
        remover_posicao = palavra.find(remover)
        if remover_posicao != -1:
            palavra_inicio = palavra[0:remover_posicao]
            palavra_fim = palavra[remover_posicao+remover_tamanho:palavra_tamanho]
            palavra = palavra_inicio + palavra_fim
        else:
            break
    return palavra

Testing:

palavras = ["batata", "menos"]
palavras_para_remover = ["a", "e", "o"]
for palavra in palavras:
    resultado = palavra;
    for remover in palavras_para_remover:
        resultado = remover_palavra(resultado, remover)
    print(resultado)

Exit:

btt
mns
  • 1

    This is exactly the result that is not expected. Note that you have not removed words, but the letters of the words. That is not the request. If the phrase is "the potato", the output should be only "potato", not "btt".

  • I think I understand what you mean. I really didn’t understand the question. But the code comes to work (as I said I’m starting with python now). Using as variables: "words = ["the potato"]palavras_para_remover = ["the "]" works, but really the ideal code is to check spaces between words...

  • 1

    See the jbueno solution above. It is very simple and does what you ask. It will be useful to study it.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.