How to add values of a csv using Python?

Asked

Viewed 1,373 times

2

I have a csv file similar to this, with all the information of all the municipalities of Brazil (I shortened the csv to not get too extensive):

ESTADO,MUNICIPIO,HABITANTES,AREA
AC,ACRELÂNDIA,12538,1807.92
AC,ASSIS BRASIL,6072,4974.18
AC,BRASILÉIA,21398,3916.5
AC,BUJARI,8471,3034.87
AL,BATALHA,17076,320.92
AL,BELÉM,4551,48.63
AM,BARCELOS,25718,122476.12
AM,BARREIRINHA,27355,5750.57
AM,BENJAMIN CONSTANT,33411,8793.42

I am trying to add up only the number of inhabitants of the northern region, in this case, AC and AM. For this I used the code below (in Python 3.6.5):

import csv

populacao = 0
arquivo = open('brasil.csv', encoding='utf8')
for registro in csv.reader(arquivo):
    habitantes = registro[2]
    estado = registro[0]
    if habitantes != 'habitantes':
        if estado != 'estado':
            regiao_norte = ['AC', 'AM']
            for estado in regiao_norte:
                populacao += int(habitantes)
print(populacao)

I get as sum: 381511598. But the sum is clearly incorrect. I thought using the list would act as a selector for the states I wanted to add. I can’t figure out what I’m missing. How can I make that sum correctly?

1 answer

7


Your mistake, at least the one that stands out most, is that you are going through every state in the North in every record, which is two, and adding up the number of inhabitants of that line twice. And this happens in all lines (north and others) because you are not filtering.

The module csv is unnecessary in most of the cases I see, this is one of them, you can just do:

regiao_norte = {'AC', 'AM'}
populacao = 0
with open('brasil.csv') as f:
    f.readline() # ignorar o nome das colunas, so para evitar fazer operacoes desnecessarias em baixo sobre esta linha 
    for l in f: # percorrer cada linha do ficheiro
        vals = l.replace('\n', '').split(',') # tirar a quebra de linha e separar por virgula
        if(vals[0] in regiao_norte): 
            populacao += int(vals[2])
print(populacao) # 134963 para o exemplo colocado

With the module csv:

import csv

regiao_norte = {'AC', 'AM'}
with open('brasil.csv') as f:
    populacao = sum(int(vals[2]) for vals in csv.reader(f) if vals[0] in regiao_norte)
print(populacao) # 134963 para o exemplo colocado

For more serious things you also have a massively used module, pandas, for this case I think it’s not worth it, but here it is in case you want to go 'further' with this dataset:

import pandas as pd

df = pd.read_csv('brasil.csv')
df_norte = df.loc[(df['ESTADO'] == 'AC') | (df['ESTADO'] == 'AM')] # linhas onde o estado for 'AM' ou 'AC'
populacao = df_norte['HABITANTES'].sum() # 134963 para o exemplo colocado
  • 1

    Thanks for the tips, @Miguel , worked out.

  • You’re welcome @Alineat, good study

  • 2

    In Pandas you could use the operator in? Something like df.loc[df['ESTADO'] in ['AC', 'AM']]

  • 1

    Hello @Andersoncarloswoss, dsculpa but just saw the question now. Unfortunately not, ValueError: The truth value of a Series is ambiguous, https://repl.it/repls/BleakIllegalArchitects. https://stackoverflow.com/questions/36921951/truth-value-of-a-series-is-ambiguous-use-a-empty-a-bool-a-item-a-any-o

Browser other questions tagged

You are not signed in. Login or sign up in order to post.