Replace use of dataframe with pandas . apply

Asked

Viewed 280 times

0

I have a function that iterates a spreadsheet, which makes some controls. I looked over and saw about . apply, but I couldn’t implode. I would like to know if you could in that case, replace the use I use and for performance purposes, change to . apply.

In this case, I have a dictionary that he consults for my treated conditions. And the function that iterates making a loop to make this happen. The function depends on the position I pass in the dictionary and that comes from the spreadsheet to know if it displays a photo field or not.

    for i, row in df.iterrows():
        if row['cargo'] in cargos_to_display_photo:
            df.loc[i, 'display_foto'] = True
        else:
            df.loc[i, 'display_foto'] = False
    return df 

the function that "prepares" the data frame is:

def prepare_dataframe(df_google):
    df_google = set_display_photo(df_google)
    return df_google

the dictionary:

cargos_to_display_photo = [
    'Trainee | Assessoria de Investimentos',
    'Sócio Fundador',
    'Superintendente',
    'Assessor de Investimentos',
    'Sócia | Assessor de Investimentos',
    'Líder Corporate',
    'Trainee | Corporate',
]

the column I consult:

columns_dict = {
    'SEXO' : 'sexo',
    'CÓDIGO' : 'codigo',
    'Nome Completo' : 'nome',
    'Cidade' : 'unidade',
    'Cargo' : 'cargo',
}

The columns that do not query or that were not cited in the excerpt I discard by the code itself, giving a drop in them with the necessary negotiations so that do not need to iterate.

the ways I researched, I couldn’t implode any. I couldn’t get apply to populate the spreadsheet. It would be possible to do this with these functions?

  • The code formatting of your question went bad, below the first df Return you should put the ```more to the left side. Another issue, you can make the spreadsheet available for testing?

  • ajustei, its link: https://docs.google.com/spreadsheets/d/1VFETvsNvA-NCFPlplCtZ1qPw3HeaCV7Te2x8kopf7ZE/edit?usp=sharing

  • @lmonferrari I took some fields from the spreadsheet because it has some confidential data. But I believe you can test.

  • Sure, no problem at all!

2 answers

0

I looked over and saw about . apply, but I couldn’t implement it. I would like to know if you could in that case, replace the use I use and for performance purposes, change to . apply.

Importing the file:

import pandas as pd

df = pd.read_excel('./Cadastro Colaboradores BT.xlsx')

List of posts:

cargos_to_display_photo = [ 'Trainee | Assessoria de Investimentos', 
                           'Sócio Fundador', 'Superintendente', 
                           'Assessor de Investimentos', 
                           'Sócia | Assessor de Investimentos', 
                           'Líder Corporate', 'Trainee | Corporate', ]

Code with the apply:

df['display_foto'] = df['Cargo'].apply(lambda x: x in cargos_to_display_photo)

I create a column called display_photo, apply it to the Job column of the dataframe and make an anonymous function that checks the contents of the column row by row.

  • Run time with for loop:

    69.4 ms 1.52 ms per loop (Mean Std. dev. of 7 runs, 10 loops each)

  • Run time with apply:

    305 µs 30.8 µs per loop (Mean Std. dev. of 7 runs, 1000 loops each)

  • I couldn’t figure it out yet. What changes in display_photo function?

  • In the code I posted the function set_display_foto is replaced by an anonymous function lambda, this function checks inside your list of cargos_to_display_fotos if in the column cargos there is some compatible value, if it finds the true value, it returns True, if it does not find it returns False. In the command df['display_foto'] = a new column called display_foto exactly as his function did.

  • but this column is created in the spreadsheet? Because I can not change it. It is not of my control, I use it only for reading the data and iteration.

  • No, Wanderson, it’s created in your data frame, just like its function does. This column will appear only if you create a new excel file or have it saved in the current spreadsheet.

  • o df['display_foto'] = df['Cargo']. apply(lambda x: x in cargos_to_display_photo) I call in prepare dataframe, in case?

  • If you want to call inside the prepare_dataframe you could do so: def prepare_dataframe(df_google):
 df_google['display_foto'] = df_google['Cargo'].apply(lambda x: x in cargos_to_display_photo)

  • If you want to return a NEW dataframe: def prepare_dataframe(df_google):
 df_prepared = df_google
 df_prepared['display_foto'] = df_google['Cargo'].apply(lambda x: x in cargos_to_display_photo)
 return df_prepared

  • cool! I just couldn’t test.. I used to: 
 for i, row in df_google.iterrows():
 ...: print(row['nome'], row['cargo'], row['display_foto'])
 in the shell and it returned the spreadsheet. How you tested?

  • gives a look: https://ibb.co/Jm4Dw50 I removed the names of the image but here’s how I do it

Show 4 more comments

0

One way to do this would be by using the function .isin() pandas. It returns a boolean list when it finds some value within the array cargos_to_display_photo

df['display_foto'] = df['Cargo'].isin(cargos_to_display_photo)
df.head()
    SEXO    CÓDIGO  Nome completo   Cidade      Cargo       display_foto
0   m       NaN     Samuel...       Ribeir...   Sócio ...   False
1   m       NaN     Leonar...       Uberlâ...   Assess...   True
2   m       NaN     Andre ...       Franca      Sócio ...   False
3   f       NaN     Rejane...       Uberlâ...   Assess...   False
4   m       NaN     Wevers...       Rio de...   Traine...   True

A good practice to get performance(speed) in pandas runs is to avoid using repetition loops as well as function apply (which is also a type of iteration). Follow the implementation time of my proposal:

220 µs 5.93 µs per loop (Mean Std. dev. of 7 runs, 1000 loops each)

Browser other questions tagged

You are not signed in. Login or sign up in order to post.