1
Good morning, I’m trying to develop a simple Python algorithm for removing stop words from texts, but I’m having problems with words that have accents.
The Code is as follows::
import io
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from unicodedata import normalize
import sys
reload(sys)
sys.setdefaultencoding('utf8')
stop_words = set(stopwords.words('portuguese'))
file1 = open("C:\Users\Desktop\Teste.txt")
print("Arquivo lido!")
line = file1.read()
palavras = line.split()
#Converte as palavra para letra minúscula
palavras = [palavra.lower() for palavra in palavras]
print("Rodando!")
for r in palavras:
if r not in stop_words:
appendFile = open('textofiltrado.txt','a')
appendFile.writelines(" "+r)
appendFile.close()
print("Concluido!")
While running code with the following test file:
E É Á A O Ó U Ú
I have this exit:
É Á Ó Ú
That is, it does not recognize words that have accent, use setdefaultencoding to utf-8 did not work, someone knows some solution I can use to solve this problem?
When using this solution it presents a problem when writing to the output file (In appendFile.writelines )
– Gabriel Naslaniec
Do
.encode('utf-8')
at the end, man. Or else ask another question– Marcelo Shiniti Uchimura