Improve python apply performance with lambda

Asked

Viewed 354 times

-1

Hello.

I am developing a code in python but it is taking time to run, I wonder if there is any other more effective method.

Below follows the def that I use

def calcMovelMensalCircuito(nue,data,circuito):

    from datetime import datetime, timedelta

    days_to_subtract =365

    dias = timedelta(days=days_to_subtract)
    d=data-dias

    #print(dezembro.query("DATA<=@data & DATA>@d & CONJUNTO==@conjunto & CIRCUITO==@circuito & MUNICIPIO==@municipio & CLASSE==@classe & CAUSA==@causa & NUE_ORDEM==@nue"))
    return eventos.query("DATA<=@data & DATA>@d & NUE_ORDEM==@nue & CIRCUITO==@circuito")["DEC_EMPRESA"].sum()

Here I use the method apply together with lambda, to create a column with the value:

dfCircuito['DEC_MOVEL']=dfCircuito.apply(lambda x:calcMovelMensalCircuito(x['NUE'],x['DATA'],x['DESCRICAO']),axis=1)
  • What is this eventos? The delay is not due to the fact of executing a query for each record of your dataframe?

  • Events is another dataframe that contains the database data, I use based on the calculations

2 answers

0

Erick, it’s all right?

You can replace the query with the groupBy method.

Run this code to verify that what you really wanted has been achieved.

df[(df['DATA']<=data) & (df['DATA'] > d) & (df['NUE'] == nue) & (df['CIRCUITO'] == circuito)].groupby(['NUE','DATA','DESCRICAO'])['DEC_EMPRESA'].sum()

If you have been what you want, just now put the values in a column:

df[(df['DATA']<=data) & (df['DATA'] > d) & (df['NUE'] == nue) & (df['CIRCUITO'] == circuito)].groupby(['NUE','DATA','DESCRICAO'])['DEC_EMPRESA'].transform('sum')

I hope it helped.

0

To apply ai calls rsua function once for each line. The "lambda" there does nothing - it’s just a syntactic sugar to pass the specific columns to the other function calcMovelMensalCircuito - it would be trivial to write this without the lambda, but it wouldn’t change anything either.

What happens there, and you’re not counting on the question, is that this function, calcMovelMensalCircuito query at the end - this query is what? Database? Any web service? Can’t guess - But whatever it is, it’s pretty clear that’s where your delay is. You can assume that this query is a call that makes I/O, either to a local bank or to a remote service - and your program will stand there waiting for the answer.

As you clarified in the comment, however, the query is Pandas itself, made completely in memory - so parallelization estrangements involving multi-threading, multiprocessing or asyncio would not help at all.

What happens is that the "query" of Pandas is a business "brute force" - it runs through the dataframe where it is made, line by line, and tests the condition every time - and generates a "virtual dataframe", which is basically a marking of the lines that matter in the original dataframe - and on top of that he does the sum. And in this case, he’s walking all the "events" table for each table row dfCircuito - This makes your algorithm quadratic (O(M x N))

But before we dive, it doesn’t seem to be the case, calcMovelMensalCircuito(nue,data,circuito) (that is, several lines would use the same value), only put a cache in the function calcMovelMensal resolve - since in the repeated values, with the cache, the external call would not be made. In this case it would only decorate the function:

from functools import lru_cache

@lru_cache()
def calcMovelMensalCircuito(nue,data,circuito):
    ... 

Now - since the values of "nue", date range, and circuit are probably unique to each line, there is nothing very "automatic" to do -only by connecting your data set - but an alternative would be to use partitioning of the data that is in the dataframe "events" - so that the query only runs in a shortened dataframe that has only the "circuit" equal to the search.

I explain: if, for example "events" have more or less 1 million lines, and some 100 different circles -before making this apply, you programmatically break the "events" and create 100 dataframes, of ~10,000 lines each in a dictionary eventos_dict - the key to each dataframe, would of course be the "circuit"- Each query of each line of your originals then, would have to go through only those ~10000 lines of the right cycle, instead of the original 1 million.

Otherwise, we would have to see other strategies to index the dataframe of events - equivalent to creating an index in a database - but I don’t know how to do the equivalent of this in Pandas.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.