0
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from string import punctuation
import csv
texto = open('arquivo_sujo.csv','r').read()
with open('arquivo_limpo.csv', 'w') as csvfile:
palavras = word_tokenize(texto.lower())
stopwords = set(stopwords.words('portuguese') + list(punctuation))
palavras_sem_stopwords = [palavra for palavra in palavras if palavra not in stopwords]
escrita = csv.writer(csvfile, delimiter=' ')
escrita.writerows(palavras_sem_stopwords)
With writerow correction for writerows solved the problem.
But how do I get the new file to have the same format?
Each line is with a word instead of the full sentence.
Ever tried to change
writerow
forwriterows
?– Woss
Thanks! But I have another problem now.
– N.Peterson
The old file has the following phrase in the first line: Develop and work. The new file looks like this: Develop linhavazia work
– N.Peterson
And you want how?
– Woss
Each line with a sentence
– N.Peterson
But the function
word_tokenize
does not return a list of words, regardless of whether it is a sentence or not?– Woss
Yes, it is vrd. Is there a way to put a delimiter to understand the end of each sentence?
– N.Peterson