I need to read a . csv file and rewrite it into another . csv file without stopwords using Python

Question

I need to read a . csv file and rewrite it into another . csv file without stopwords using Python

Asked 6 years ago

Viewed 116 times

0

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from string import punctuation
import csv


 texto = open('arquivo_sujo.csv','r').read()

 with open('arquivo_limpo.csv', 'w') as csvfile:
    palavras = word_tokenize(texto.lower())

    stopwords = set(stopwords.words('portuguese') + list(punctuation))
    palavras_sem_stopwords = [palavra for palavra in palavras if palavra not in stopwords]


    escrita = csv.writer(csvfile, delimiter=' ')
    escrita.writerows(palavras_sem_stopwords)

With writerow correction for writerows solved the problem.
But how do I get the new file to have the same format? Each line is with a word instead of the full sentence.

Ever tried to change writerow for writerows?

– Woss

2018/06/15 at 22:31
Thanks! But I have another problem now.

– N.Peterson

2018/06/15 at 22:37
The old file has the following phrase in the first line: Develop and work. The new file looks like this: Develop linhavazia work

– N.Peterson

2018/06/15 at 22:39
And you want how?

– Woss

2018/06/15 at 22:45
Each line with a sentence

– N.Peterson

2018/06/15 at 22:47
But the function word_tokenize does not return a list of words, regardless of whether it is a sentence or not?

– Woss

2018/06/15 at 22:48
Yes, it is vrd. Is there a way to put a delimiter to understand the end of each sentence?

– N.Peterson

2018/06/15 at 22:55

Show 2 more comments

1 answer

Browser other questions tagged python csv nltk

You are not signed in. Login or sign up in order to post.

by Woss • **73,416** points · Answer 1 · 2018-06-15T22:55:09+00:00

I believe the main problem is that you are reading and analyzing the entire contents of the file, and you want sentence by sentence. Then, to resolve, you should read each line of the input file separately:

stopwords = set(stopwords.words('portuguese') + list(punctuation))

with open('arquivo_sujo.csv') as stream_input, open('arquivo_limpo.csv', 'w') as stream_output:
    for phrase in stream_input:
        words = word_tokenize(phrase.lower())
        without_stopwords = [word for word in words if word not in stopwords]
        stream_output.write(' '.join(without_stopwords) + '\n')

In this case, each line of the input file will be processed separately and saved in the output file without the stopwords. As the writing formatting is simple, I see no need to use the module csv, the join already solves the problem well.