Application problem for stopwords and accents filter

Asked

Viewed 403 times

1

Good morning, I’m trying to develop a simple Python algorithm for removing stop words from texts, but I’m having problems with words that have accents.

The Code is as follows::

import io
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from unicodedata import normalize
import sys

reload(sys)
sys.setdefaultencoding('utf8')

stop_words = set(stopwords.words('portuguese'))
file1 = open("C:\Users\Desktop\Teste.txt")
print("Arquivo lido!")
line = file1.read()
palavras = line.split()
#Converte as palavra para letra minúscula
palavras = [palavra.lower() for palavra in palavras]
print("Rodando!")
for r in palavras:
    if r not in stop_words:
            appendFile = open('textofiltrado.txt','a')
            appendFile.writelines(" "+r)
            appendFile.close()

print("Concluido!")

While running code with the following test file:

E É Á A O Ó U Ú

I have this exit:

 É Á Ó Ú

That is, it does not recognize words that have accent, use setdefaultencoding to utf-8 did not work, someone knows some solution I can use to solve this problem?

1 answer

-1

Use

palavra.decode('utf-8').lower()

Source: here

  • When using this solution it presents a problem when writing to the output file (In appendFile.writelines )

  • Do .encode('utf-8') at the end, man. Or else ask another question

Browser other questions tagged

You are not signed in. Login or sign up in order to post.