0
I need to deal with a base junction problem in the Python language. I have three layers of folders that I need to enter, find the file and merge into a dataframe. The layers being: year, month and day. The files containing the data are of the type .txt
and I’m interpreting the algorithm in Jupyter Notebook
. The algorithm to put them together I have done. I will try to present them in parts.
First I get the folders of the directory containing the years old:
# Directory
os.chdir('diretorio')
# Moving first layer - year
lista_dir1 = [f for f in glob.glob('20*')]
lista_dir1.sort()
lista_dir1
After creating a list with the directory folders, I make a loop to traverse each element of the list and map the folders of months every year:
contador = 0
for t in range(len(lista_dir1)):
prim_cam = "diretorio" + '/' + lista_dir1[t]
os.chdir(prim_cam)
lista_dir2 = [f for f in glob.glob('*')]
lista_dir2 = list(map(int, lista_dir2))
lista_dir2.sort()
lista_dir2 = list(map(str, lista_dir2))
After entering the year, mepear the folders of the months and create a second list those months, I enter each month and search for the databases of days, creating a third list of days:
for p in range(len(lista_dir2)):
segun_cam = prim_cam + '/' + lista_dir2[p]
print(segun_cam)
os.chdir(segun_cam)
lista_dir3 = [f for f in glob.glob('*')]
lista_dir3 = list(map(int, lista_dir3))
lista_dir3.sort()
lista_dir3 = list(map(str, lista_dir3))
Finally, I select each database, get the data and merge into a dataframe:
for y in range(lista_dir3):
dados = open(lista_dir3[y])
yourList = dados.readlines()
if((t == 0) and (p == 0) and (y == 0)):
dados_compl = pd.DataFrame(columns = list(yourList[0].split(',')))
for l in range(1,len(yourList)):
dados_compl.loc[l + contador*1440] = list(yourList[l].split(','))
contador += 1
else:
for l in range(1,len(yourList)):
dados_compl.loc[l + contador*1440] = list(yourList[l].split(','))
contador += 1
Still, I did an accountant called contador
to index my dataframe.
Well, now come the questions.
1. I am using notebook jupyter and the process has been time consuming, there is difference of processing if I run the program in another Python interpreter?
2. The columns of my dataframe are of type Object
Date object
Time object
Global_active_power object
Global_reactive_power object
Voltage object
Global_intensity object
Sub_metering_1 object
Sub_metering_2 object
Sub_metering_3\n object
dtype: object
it would be advisable to turn the numbers to float
, this would make my processing faster?
3. I used those functions
dados = open(lista_dir3[y])
yourList = dados.readlines()
to read the data, would there be a more efficient alternative?
4. What is a type of data object
? I program a lot in R and I don’t remember seeing that kind of data.
Sincerely yours!