Processing Complexity in a Pandas Dataframe

Question

Processing Complexity in a Pandas Dataframe

Asked 5 years, 11 months ago

Viewed 107 times

0

I need to deal with a base junction problem in the Python language. I have three layers of folders that I need to enter, find the file and merge into a dataframe. The layers being: year, month and day. The files containing the data are of the type .txt and I’m interpreting the algorithm in Jupyter Notebook. The algorithm to put them together I have done. I will try to present them in parts.

First I get the folders of the directory containing the years old:

# Directory
os.chdir('diretorio')

# Moving first layer - year
lista_dir1 = [f for f in glob.glob('20*')]
lista_dir1.sort()
lista_dir1

After creating a list with the directory folders, I make a loop to traverse each element of the list and map the folders of months every year:

contador = 0

for t in range(len(lista_dir1)):
    prim_cam = "diretorio" + '/' + lista_dir1[t] 

    os.chdir(prim_cam)

    lista_dir2 = [f for f in glob.glob('*')]
    lista_dir2 = list(map(int, lista_dir2))
    lista_dir2.sort()
    lista_dir2 = list(map(str, lista_dir2))

After entering the year, mepear the folders of the months and create a second list those months, I enter each month and search for the databases of days, creating a third list of days:

for p in range(len(lista_dir2)):

        segun_cam = prim_cam + '/' + lista_dir2[p]
        print(segun_cam)

        os.chdir(segun_cam)

        lista_dir3 = [f for f in glob.glob('*')]
        lista_dir3 = list(map(int, lista_dir3))
        lista_dir3.sort()
        lista_dir3 = list(map(str, lista_dir3))

Finally, I select each database, get the data and merge into a dataframe:

for y in range(lista_dir3):
            dados = open(lista_dir3[y])
            yourList = dados.readlines()


            if((t == 0) and (p == 0) and (y == 0)):

                dados_compl = pd.DataFrame(columns = list(yourList[0].split(',')))

                for l in range(1,len(yourList)):
                    dados_compl.loc[l + contador*1440] = list(yourList[l].split(',')) 
                contador += 1

            else:

                for l in range(1,len(yourList)):
                    dados_compl.loc[l + contador*1440] = list(yourList[l].split(',')) 
                contador += 1

Still, I did an accountant called contador to index my dataframe.

Well, now come the questions.

1. I am using notebook jupyter and the process has been time consuming, there is difference of processing if I run the program in another Python interpreter?

2. The columns of my dataframe are of type Object

Date                     object
Time                     object
Global_active_power      object
Global_reactive_power    object
Voltage                  object
Global_intensity         object
Sub_metering_1           object
Sub_metering_2           object
Sub_metering_3\n         object
dtype: object

it would be advisable to turn the numbers to float, this would make my processing faster?

3. I used those functions

dados = open(lista_dir3[y])
                yourList = dados.readlines()

to read the data, would there be a more efficient alternative?

4. What is a type of data object? I program a lot in R and I don’t remember seeing that kind of data.

Sincerely yours!

1 answer

Browser other questions tagged python pandas

You are not signed in. Login or sign up in order to post.

by Humberto • 26 points · Answer 1 · 2019-08-09T21:08:43+00:00

Summing up my understanding:

Vc has files stored somewhere in base dir/year/Month/day/some file.txt (or would it be csv ? ) and need to merge the dataframes derived from these txt files faster.

I don’t know exactly if it’ll get any faster but I can give you some ideas.

Relate the item using only the files in the date range (I don’t know if you are already doing this)
Use the pandas.Concat function to join data frames by line instead of iloc (I don’t know if it is faster)

See reference here: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

Below is a suggestion of code:

base_dir = '<diretorio base dos arquivos>'

df = pd.Dataframe()

for year in range(year_init, year_end + 1):
  for month in range(month_init, month_end + 1):
    for day in range(day_init, day_end + 1):
      full_filename = base_dir + '/' + year + '/ + month + '/' + day + '/' + filename # tem como saber o nome do arquivo dentro do diretorio ?
      new_df = pd.read_csv(full_filename)
      df = pd.concat(df, new_df)