How to unify by Python several repeated lines of an Excel file?

Asked

Viewed 49 times

1

I have this spreadsheet in Excel, which is generated in the system by the pandas dataframe

inserir a descrição da imagem aqui

I can even generate the file, the problem is that it generates this way above. I need that, for example, in the column nome and sobrenome, it unifies the same names on a single line and therefore the data will be on the same line.

The end result has to be like this:

inserir a descrição da imagem aqui

My code generates the spreadsheet but is not unifying the lines. How can I do this?

That’s the part of the code I can’t go through:

import pandas as pd

resultado = []
mydict = {}
for row in dados:
       if mydict  != {}:
            resultado.append(mydict)
       mydict = {}
       mydict['nome'] = row['nome']
       mydict['sobrenome'] = row['sobrenome']

If I take that if of the loop for and I put only one specific name, I get it to print all on one line, but if it goes back to the for, it prints all the nomes and sobrenomes, only with the information on separate lines.

1 answer

0

First you create a dataframe only with the repeating data, that is to say, nome and sobrenome (which will be used as chave primária), and remove duplicates like this:

df = dados[['nome','sobrenome']].drop_duplicates()
display(df)

Upshot:

name surname
John Peter
Paul Joseph

This table will serve as the main table to join the other information.

Now for each age column, you need to create a table with the chave primária of the main table and the age values of each name, removing the lines containing null values with the method .dropna(axis=0), thus:

for i in range(2,5):  # 2,5 são os indíces das colunas com as idades
  temp = dados.iloc[:,[0,i]].dropna(axis=0)  # Recebe a coluna nome e a coluna de idade
  df = pd.merge(df, temp)

On the line

df = pd.merge(df, temp)

we are merging our main table with the temporary table that contains the values of the age column.

The final result returns its table without the null and duplicate values:

name surname Idade_agosto_2019 Idade_agosto_2020 Idade_agosto_2021
John Peter 30.0 31.0 32.0
Paul Joseph 21.0 22.0 23.0
  • Good morning Felipe, it worked. Thank you very much. A doubt, As I withdraw the decimal places in the result of the column age?

  • Good morning, Fabricio, add .astype('Int64') after the method .dropna(axis=0) on the variable line temp, getting: temp = dados.iloc[:,[0,i]].dropna(axis=0).astype('Int64').

  • Show. Thank you.

  • Felipe. Another question. In this code below, it runs cool if I have few lines. What if I have many lines? How do I change this part in the for loop? for i in range(2,5): #2,5 are the columns' indices with ages temp = data.iloc[:,[0,i]]. dropna(Axis=0) # Takes the column name and the column age df = pd.merge(df, temp)

  • Fabricio, you will only change the value of range(2,5) If you have other old columns or change the column layout, a table with more rows should work normally. A question, why is the name separate from the surname? Because, if you have a repeated name, I believe it will result in some error.

  • Hi Felipe, good morning. Thanks for the help. I get it. Vdd, the values within the range are related to the columns. So in case you have more columns of age, do I change the second value to the maximum Qtde of columns? Is that it? There is no reason for the name and surname to be separated, I put so for study. Only simulation even with several possible situations.

  • Exactly Fabricio, when adding another column, change the second element of range for the total of columns. I suggest you use only one Columa for the name to avoid problems with duplicated first names, but you may need to make some modifications to your code, I am available to help you, have a great day.

  • Gratitude Felipe, good weekend.

Show 3 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.