How to improve runtime

Asked

Viewed 136 times

1

I have a csv with data from some matriculations I’m doing a study, I generated an id for each matriculation, and wanted to check if it repeats in the same year.

I have the following code:

# retorna true se o idx for repetido.
def repetido(idx, ano, df):
    cont = 0 
    for index, row in df.iterrows():
        if idx == row['Id'] and ano == row['NU_ANO_CENSO']:
            cont += 1
    if cont > 1:
        return True
    else:
        return False
        
# imprime uma lista com os id que se repetem, se tiver algum.
def contIdRepetido():
    df = pd.read_csv('../dados/dados_padronizados_matriculas_januaria_2009_2018_com_id.csv')
    repetidos = []
    for index, row in df.iterrows():
        if repetido(row['Id'], row['NU_ANO_CENSO'], df):
            repetidos.append(row['Id'])
    print(f'Id repetidos: {repetidos}')

But this way it’s taking too long to execute, someone knows somehow more efficiently to do it?

csv can be found here

2 answers

4


You can use groupby + value_counts of pandas:

pd.set_option('display.max_rows', None)
df.groupby(['NU_ANO_CENSO'])['Id'].value_counts()

This line serves to remove the display limit if you are using the jupyter notebook for example:

pd.set_option('display.max_rows', None)

Exit:

               id    repetições
2009          2337       3
              1736       2
              1898       2
              1963       2
              2217       2
              2304       2
              2343       2
              2845       2
              2880       2
                     ..
2018          3723       1
              3724       1
              3725       1
              3726       1
              3727       1

Using this way you earn a lot in performance.


Edit

As Terry’s suggestion to return only where values are greater than 1:

valores = df.groupby(['NU_ANO_CENSO'])['Id'].value_counts()

Pandas returns an object of type Series as demonstrated:

type(valores)
pandas.core.series.Series

So we can list:

valores[valores > 1].index.values

Exit:

array([(2009, 2337), (2009, 1736), (2009, 1898), (2009, 1963),
       (2009, 2217), (2009, 2304), (2009, 2343), (2009, 2845),
       (2009, 2880), (2009, 3114), (2009, 3398), (2010, 2337),
       .......................................................
       (2018, 3355), (2018, 3364), (2018, 3398), (2018, 3406),
       (2018, 3407), (2018, 3423), (2018, 3449), (2018, 3451),
       (2018, 3452), (2018, 3453), (2018, 3454), (2018, 3467),
       (2018, 3474), (2018, 3484), (2018, 3485), (2018, 3486),
       (2018, 3520), (2018, 3521), (2018, 3536), (2018, 3556),
       (2018, 3557), (2018, 3585), (2018, 3602), (2018, 3605)]
  • 1

    I did in a slightly different way to returns the values greater than one, plus this was the method with higher performance, thank you!

  • 1

    Yes, there are several ways to return these values. I just wanted to show you that it was possible by doing the pandas functions. I’m glad it helped you! Hugs!!

3

You call the function repetidos for each line of your dataframe, and this function in turn, looks again at all the lines of the dataframe.

That is, if it has 10 lines, it will make 100 comparisons. 1000 lines, 1 million comparisons. Not a good idea at all.

A superficial study might want to exchange the search made in either of the two steps for a search of the bands, instead of using the for ... iterrows - this causes the search to be done by internal pandas/numpy code that can be many times faster (10 to 100) - but this would not solve the fundamental problem: you would continue making N² comparisons.

Both dataframes and Python lists have a linear search (and therefore comparison): that is, each data line has to be evaluated.

This is usually solved using a different data structure - like a Python dictionary, for example - m dictionaries, the search for keys is constant time, no matter how much data is in the dictionary.

Then you can go through your dataframe once, and go counting the repetitions with a dictionary, instead of calling another function that goes through the entire dataframe again. In this case, as you just want to know if there was repetition or not, can sweat a set (set) - just to mark the already seen ones. Sets, like dictionaries, also have constant access time. It looks like this:

      
# sobre o nome da função - em Python se usa o "snake case"
# pesauisas ao longo de décadas mostraram que, se não for
# mais légivel, pelo menos é preferido pelos programadores de Python:

def conta_id_repetido(df):
    # ^ o mais legal de funções é poder re-usa-las. Não faz sentido ter
    # o caminho do arquivo de dados fixo dentro da função - o melhor
    # é ela receber os dados para processar - ou - pelo
    # menos o caminho como parâmetro.
    
    # "ah, mas o arquivo vai ser sempre esse". Não! Estamos aqui,
    # eu respondendo sua pergunta, e obviamente não tenho o 
    # arquivo no caminho que estava aqui.
    
    repetidos = []
    vistos = set()
    for index, row in df.iterrows():
        chave = row["Id"], row["NU_ANO_CENSO"]
        if chave not in vistos:
            vistos.add(chave)
        else:
            repetidos.append(row["Id"])
    return repetidos
    
def principal():
    # A função principal do programa é o ponto de entrada
    # e orquestra o codigo reutilizavel das outras funções - aqui,
    # sim, pode fazer sentido ler o dataframe de um arquivo específico
    # (e mesmo assim, o caminho em disco ainda pode vir, por exemplo,
    # de um parametro da linha de comando, ou de uma variável de ambiente)
    # e aqui faz sentido o "print"
    df = pd.read_csv('../dados/dados_padronizados_matriculas_januaria_2009_2018_com_id.csv')
    repetidos = conta_id_repetido(df)
    print(f'Id repetidos: {repetidos}')

In this code, if your dataframe has 1000 lines, are made...1000 comparisons, not 1 million!

  • Thanks friend! helped me a lot.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.