Fastest way to iterate over lines in python, pandas

Asked

Viewed 126 times

3

I’m having a certain performance difficulty in the code because it’s taking too long to run.

I have the following Dataframe. (Example because what I’m wearing is much bigger)

    orders = {'Código': [600, 600, 601, 602],
              'Num. Pedido': [1000, 1000, 1002, 1003],
              'Data Pedido': ['10/01', '10/01', '08/09', 12/01],
              'Sabor': ['Calabresa', 'Mussarela', 'Pepperoni', 'Portuguesa'],
              'Quantidade': [1, 1, 1, 1],
              'Metade': [1, 2, 1, 1],
              'Meia': ['Sim', 'Sim', 'Não', 'Não']
              'Preço': [40.0, 32.0, 45.0, 35.0]} 

    df = pd.DataFrame(data=orders)

What happens is, the order 1000 is practically equal to the order 1000, what changes is the Half Taste and the Price. What I want to do is put the Flavor of a half next to the second half for example, Calabresa+Mozzarella. And divide the price by 2 because the price is whole. I did it as follows, but it is running for more than 10 hours.


cache_metade = {}
df_final = pd.DataFrame()
lista = []

for index, row in df.iterrows():
    
    if index not in cache_metade:
        
        df_metade = df.loc[meioameio['Num. Pedido'] == row['Num. Pedido']]
        df_metade = df_metade.loc[df_metade['Data Pedido'] == row['Data Pedido']]
        df_metade = df_metade.loc[df_metade['Código'] == row['Código']]
        df_metade = df_metade.loc[df_metade['Metade'] != row['Metade']]
        df_metade = df_metade[~df_metade.index.isin(cache_metade)]
        
        if len(df_metade.index) > 0:
            
            metade_index = df_metade.index[0]
            metade = df_metade.iloc[0]
                
            if row['Sabores'] and metade['Sabores'] is not None:
                
                metade['Sabor'] = metade['Sabor'] + "+" + row['Sabor']
                metade['Preço'] = (metade['Preço'] / 2) + (row['Preço'] / 2)
                lista.append(row['Código'])
                df_final = df_final.append(metade)
                cache_metade[index] = True
                cache_metade[metade_index] = True

drop = df[df['Código'].isin(lista)].index
df_pedidos = df.drop(drop)
df_pedidos = pd.concat([df_pedidos, df_final])

Is there another way to do this? A way to be more efficient. Remember, the database I have is much larger than the one I exemplified.

  • 1

    Is it a real system or is it an exercise?

  • It’s a project I’m creating

  • 1

    If it is commercial, do not use the dataframe in the service, it is better to take advantage of data analysis and the way records are fragmented is ineffective. Already know Sqlite? python has a module dedicated to this DB. See here

  • And what would the column "Half"?

  • You can put a small example df of what the final result would look like

  • @Miguel Orders = {'Code': [600, 601, 602], 'Num. Order': [1000, 1002, 1003], 'Date Request': ['10/01', '08/09', '12/01'], 'Taste': ['Calabresa+Mussrela', 'Pepperoni', 'Portuguese'], 'Quantity': [1, 1], 'Half': ['1+2', '2', '1', '1'], 'Half': ['Yes', 'No', 'No'], 'Price': [36.0, 45.0, 35.0]}

Show 1 more comment

1 answer

3


Having the following original dataframe:

   Código Data Pedido Meia Metade  Num. Pedido  Preço Quantidade       Sabor
0     600       10/01  Sim      1         1000   40.0          1   Calabresa
1     600       10/01  Sim      2         1000   32.0          1   Mussarela
2     601       08/09  Não      1         1002   45.0          1   Pepperoni
3     602       12/01  Não      1         1003   35.0          1  Portuguesa

Then all this code can be reduced to:

df['Quantidade'] = df['Quantidade'].astype(str) # cast para str para podermos concatenar
df['Metade'] = df['Metade'].astype(str) # cast para str para podermos concatenar
df_final = df.groupby(['Código', 'Data Pedido', 'Num. Pedido', 'Meia'], as_index=False).agg({'Preço': 'mean', 'Metade': '+'.join, 'Sabor': '+'.join, 'Quantidade': '+'.join})

df_final:

   Código Data Pedido  Num. Pedido Meia Quantidade                Sabor Metade  Preço
0     600       10/01         1000  Sim        1+1  Calabresa+Mussarela    1+2   36.0
1     601       08/09         1002  Não          1            Pepperoni      1   45.0
2     602       12/01         1003  Não          1           Portuguesa      1   35.0

Let’s group by ['Código', 'Data Pedido', 'Num. Pedido', 'Meia'] and perform the desired operations for each of the other columns in the function Agg.

From what I understand, this is what you want.

But as stated in comments if this data is coming from a database it is better to work (do the query) there.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.