I need to read a . csv file and rewrite it into another . csv file without stopwords using Python

Asked

Viewed 116 times

0

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from string import punctuation
import csv


 texto = open('arquivo_sujo.csv','r').read()

 with open('arquivo_limpo.csv', 'w') as csvfile:
    palavras = word_tokenize(texto.lower())

    stopwords = set(stopwords.words('portuguese') + list(punctuation))
    palavras_sem_stopwords = [palavra for palavra in palavras if palavra not in stopwords]


    escrita = csv.writer(csvfile, delimiter=' ')
    escrita.writerows(palavras_sem_stopwords)

With writerow correction for writerows solved the problem.
But how do I get the new file to have the same format? Each line is with a word instead of the full sentence.

  • Ever tried to change writerow for writerows?

  • Thanks! But I have another problem now.

  • The old file has the following phrase in the first line: Develop and work. The new file looks like this: Develop linhavazia work

  • And you want how?

  • Each line with a sentence

  • But the function word_tokenize does not return a list of words, regardless of whether it is a sentence or not?

  • Yes, it is vrd. Is there a way to put a delimiter to understand the end of each sentence?

Show 2 more comments

1 answer

0


I believe the main problem is that you are reading and analyzing the entire contents of the file, and you want sentence by sentence. Then, to resolve, you should read each line of the input file separately:

stopwords = set(stopwords.words('portuguese') + list(punctuation))

with open('arquivo_sujo.csv') as stream_input, open('arquivo_limpo.csv', 'w') as stream_output:
    for phrase in stream_input:
        words = word_tokenize(phrase.lower())
        without_stopwords = [word for word in words if word not in stopwords]
        stream_output.write(' '.join(without_stopwords) + '\n')

In this case, each line of the input file will be processed separately and saved in the output file without the stopwords. As the writing formatting is simple, I see no need to use the module csv, the join already solves the problem well.

  • I hadn’t tried Jay yet,!

Browser other questions tagged

You are not signed in. Login or sign up in order to post.