Application problem for stopwords and accents filter

Question

Application problem for stopwords and accents filter

Asked 7 years, 4 months ago

Viewed 403 times

1

Good morning, I’m trying to develop a simple Python algorithm for removing stop words from texts, but I’m having problems with words that have accents.

The Code is as follows::

import io
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from unicodedata import normalize
import sys

reload(sys)
sys.setdefaultencoding('utf8')

stop_words = set(stopwords.words('portuguese'))
file1 = open("C:\Users\Desktop\Teste.txt")
print("Arquivo lido!")
line = file1.read()
palavras = line.split()
#Converte as palavra para letra minúscula
palavras = [palavra.lower() for palavra in palavras]
print("Rodando!")
for r in palavras:
    if r not in stop_words:
            appendFile = open('textofiltrado.txt','a')
            appendFile.writelines(" "+r)
            appendFile.close()

print("Concluido!")

While running code with the following test file:

E É Á A O Ó U Ú

I have this exit:

 É Á Ó Ú

That is, it does not recognize words that have accent, use setdefaultencoding to utf-8 did not work, someone knows some solution I can use to solve this problem?

1 answer

Browser other questions tagged python

You are not signed in. Login or sign up in order to post.

by Marcelo Shiniti Uchimura • **3,302** points · Answer 1 · 2018-06-27T15:31:13+00:00

-1

Use

palavra.decode('utf-8').lower()

Source: here

When using this solution it presents a problem when writing to the output file (In appendFile.writelines )

– Gabriel Naslaniec

2018/06/27 at 16:04
Do .encode('utf-8') at the end, man. Or else ask another question

– Marcelo Shiniti Uchimura

2018/06/27 at 16:13