A brief explanation
A very common error is to use lambda calling a function and in this function reference the whole dataframe, when the ideal would only be to reference the line
Example:
For the call df["nova coluna"] = df.apply(lambda x: minha_func(x), axis=1)
The mistake would be
def minha_func(row):
if df["coluna"] > df["outra coluna"]:
return "Maior"
else:
return "Menor ou igual"
In a 1000 Rows dataframe, the script would run the lambda
1000 times calling the function. The function in turn would run on the entire dataframe (X1000). That is, the script would have an iteration of 1000 x 1000... Instead of just 1000.
The right thing should be
def minha_func(row):
if row["coluna"] > row["outra coluna"]:
return "Maior"
else:
return "Menor ou igual"
We now come to the solutions
Case 1, using apply
The dataframe is
>>> df
Product Location Start_Date End_Date Forecast Comment Forecast RELEX
0 90646164 1 01/04/2021 30/04/2021 15 OPECOM 13,07
1 90646164 1 01/04/2021 30/04/2021 10 OPECOM 13,07
2 90646150 1 01/04/2021 30/04/2021 11 OPECOM 27,24
3 90646143 1 01/04/2021 30/04/2021 12 OPECOM 5,87
- Step 1: Transform column
Forecast RELEX
float
# Virgula para ponto
df["Forecast RELEX"] = df["Forecast RELEX"].str.replace(",", ".")
# Coluna para float
df["Forecast RELEX"] = pd.to_numeric(df["Forecast RELEX"], errors='coerce')
- Step 2 (all in the lambda)
df["Status"] = df.apply(lambda row: "Subir" if row["Forecast"] >= row["Forecast RELEX"] else "Nao Subir", axis=1)
def get_status(row):
if row["Forecast"] >= row["Forecast RELEX"]:
return "Subir"
else:
return "Nao Subir"
df["Status"] = df.apply(lambda row: get_status(row), axis=1)
The result will be:
Product Location Start_Date End_Date Forecast Comment Forecast RELEX Status
0 90646164 1 01/04/2021 30/04/2021 15 OPECOM 13.07 Subir
1 90646164 1 01/04/2021 30/04/2021 10 OPECOM 13.07 Nao Subir
2 90646150 1 01/04/2021 30/04/2021 11 OPECOM 27.24 Nao Subir
3 90646143 1 01/04/2021 30/04/2021 12 OPECOM 5.87 Subir
Case 2
- Step 1: Transform column
Forecast RELEX
float
Do the same as described in Caso 1
above
- Step 2: Create a column
Status
with Nan
import numpy as np
df["Status"] = np.nan
- Step 3: Dealing with Climbing
df.loc[df["Forecast"] >= df["Forecast RELEX"], ["Status"]] = "Subir"
- Step 4: Deal with "not climbing"
Since Subir
has already been inserted...
df["Status"].fillna("Nao subir", inplace=True)
Note Although the Caso 2
has more steps, it is much faster to run than the first. That is, in really large dataframes, opt for the second case.
The result is the same
Product Location Start_Date End_Date Forecast Comment Forecast RELEX Status
0 90646164 1 01/04/2021 30/04/2021 15 OPECOM 13.07 Subir
1 90646164 1 01/04/2021 30/04/2021 10 OPECOM 13.07 Nao subir
2 90646150 1 01/04/2021 30/04/2021 11 OPECOM 27.24 Nao subir
3 90646143 1 01/04/2021 30/04/2021 12 OPECOM 5.87 Subir
The column Forecast Relex contains numbers with comma
(13,07)
, to be valid your if would have to be point(13.07)
instead of commas– Zé Reis M. Olliver
You’re probably comparing two strings,
'12'
and'5,87'
, then in this comparison the character'5'
is larger than the'1'
, returning the second value as higher.– Woss
If I turn , to . wrong, with the following error: '>' not supported between instances of 'str' and 'float'
– Vivian