Reading of multiple datasets

Asked

Viewed 173 times

1

I am trying to read Anatel’s datasets file, but it is divided by state. Is there any way I can read all the files in the folder at once? I did reading file by file and joining all in one.

#Importando os datasets do ano de 2016

df_ac2016 = pd.read_csv('dataset/Solicitações Registradas na Anatel (2016-AC).csv', sep=';', encoding='latin-1')
df_al2016 = pd.read_csv('dataset/Solicitações Registradas na Anatel (2016-AL).csv', sep=';',encoding='latin-1')
df_am2016 = pd.read_csv('dataset/Solicitações Registradas na Anatel (2016-AM).csv', sep=';',encoding='latin-1')
df_ap2016 = pd.read_csv('dataset/Solicitações Registradas na Anatel (2016-AP).csv', sep=';',encoding='latin-1')
df_ba2016 = pd.read_csv('dataset/Solicitações Registradas na Anatel (2016-BA).csv', sep=';',encoding='latin-1')
df_ce2016 = pd.read_csv('dataset/Solicitações Registradas na Anatel (2016-CE).csv', sep=';',encoding='latin-1')
df_df2016 = pd.read_csv('dataset/Solicitações Registradas na Anatel (2016-DF).csv', sep=';',encoding='latin-1')
df_es2016 = pd.read_csv('dataset/Solicitações Registradas na Anatel (2016-ES).csv', sep=';',encoding='latin-1')

2 answers

1


This can be done using the library glob

import glob
arquivos = glob.glob('dataset/*.csv')
# 'arquivos' agora é um array com o nome de todos os .csv existentes na pasta 'dataset'
array_df = []

for x in arquivos:
    temp_df = pd.read_csv(x, sep=';',encoding='latin-1')
    array_df.append(temp_df)

After that, you can join them in whatever way you want.

Editing

To concatenate them, you can do so:

df = pd.concat(array_df, ignore_index=True)
  • I got it here, but now I’ll see if I can convert it to dataframe. He returned a list and when I try to pass the conversation he only converts the indices.

  • @Rafaelm. I added an option to concatenate the data. See if it helps you

  • I used it here and it worked. I was trying to force the conversion before concatenating.

1

Today the pd.read_csv does not yet have this functionality, however the Dask which is a pandas-based lib can perform this processing, you can read with Dask and turn to pandas if you want.

A solution that does not solve your problem, but which is more generalist would be this below:

import os
import pandas as pd

arquivos = [f for f in os.listdir("/diretorio/")]
df = pd.concat(map(pd.read_csv, arquivos))

Browser other questions tagged

You are not signed in. Login or sign up in order to post.