Count words in CSV file

Asked

Viewed 72 times

-1

I’m trying to read a CSV file and create a list of all the words in the file and how many times it appears. Originally the file was in PDF, but I thought it could be simpler to read in csv. I’m using Google Collab. I started looking for this information in Stack Overflow in EN and made a small adaptation for Google Colab, since I need to import the file.

import csv
from google.colab import files
from collections import Counter
from collections import defaultdict

words= []
arquivo = files.upload()

with open(str(arquivo), 'rt') as csvfile: #aqui eu coloquei a variável arquivo em str
    reader = csv.reader(csvfile)
    next(reader)
    for col in reader:
         csv_words = col[0].split(" ")
         for i in csv_words:
              words.append(i)

I’ve already received the "File name is Too long" error (OSERROR 36). I do not know how to continue from here. If anyone can guide me, I thank!

Good weekend for everyone.

  • can provide example file?

  • @Lucas, the link to the test file is https://drive.google.com/file/d/1tJ0Ri1exQwG15zce7orOIBVhlWRdtOI8/view?usp=sharing

1 answer

1


Bro, I did it in Pycharm, but it’ll be easy to play in the collab.

Note: Since Python is case sensitive, that is, a minuscule 'a' is different from a capital 'A', maybe put an upper or better Power to analyze and remove the accents, I left the library and example below commented ( unicodedata )

import csv

words = dict()
arquivo = 'teste_arquivo1.csv'

with open(arquivo, 'r', encoding='utf-8') as csvfile:  # aqui eu coloquei a variável arquivo em str
    reader = csv.reader(csvfile)
    for row in reader:

        for palavra in row[0].split():

            # Se existir é verdadeiro
            if palavra in words:
                words[palavra] = int(words[palavra]) + 1
            else:
                words[palavra] = 1

# Imprime o dicionario
print(words)

print()

# For do dicionario por linha
for chave, valor in words.items():
    print('Palavra = ' + str(chave)+((20-len(str(chave)))*' ') + ' - Quantidade = ' + str(valor))

'''
from unicodedata import normalize

def remover_acentos(txt, codif='utf-8'):
    palavra =  normalize('NFKD', txt).encode('ASCII', 'ignore').decode('ASCII')
    
    #Minusculo 
    palavra = palavra.lower()
    
    return palavra
    
print(remover_acentos('gráfico'))   
'''

inserir a descrição da imagem aqui

Browser other questions tagged

You are not signed in. Login or sign up in order to post.