How to filter rows where columns meet consecutive conditions in Python?

Asked

Viewed 97 times

3

I’m trying to filter lines in which the columns comply with conditions consecutively. That is, if the row has columns with the conditions of after an L/I, the next column has a A/S, then return the value of 1 in the new column (if no, return 0)

Input:

       RFA RFB RFC RFD       
    0   S   S   S   S   
    1   A   I   A   A       
    2   A   A   L   A       
    4   S   S   L   A       

Output:

       RFA RFB RFC RFD  promo
    0   S   S   S   S     0
    1   A   I   A   A     1 
    2   A   A   L   A     1
    4   S   S   L   A     1

Script:

      def promo_behaviour(x):
          for i in range(0,95411):
             for j in data_rfa_r.columns:
                 if (x[j][i] == 'L' or x[j][i] == 'I') and (x[j][i+1] == 'A' or x[j][i+1] == 'S'):
                    return 1
                 else:
                    return 0
      data_rfa_r['promo'] = data_rfa_r.apply(promo_behaviour)

I wrote this function but without success (95411 are the number of remarks/lines).

I forgot to mention that in the context of the problem, the index column 0 is the latest! I mean, it should be read from right to left.

EDIT:

Output:
       RFA promo2 RFB promo1 RFC RFD    
    0   S    0    S     0     S   S   
    1   A    1    I     0     A   A     
    2   A    0    A     1     L   A   
    4   S    0    S     0     L   A   

   

  • Good afternoon! In the actual database no! but there are more than 25 variables( --> 25 columns)...

2 answers

2

You can use isin by creating a list of possible combinations.

vl = ['LA','IA','LS','IS']
dados['promo'] = (dados.shift(axis = 1) + dados).isin(vl).any(axis = 1).astype(int)
  1. dados.shift 'move' the data frame
  2. isin checks the occurrence within the list
  3. any checks for occurrence of True on lines
  4. astype(int) returns 0 or 1 instead of True or False

1


One Line Solution:

df['promo']=pd.Series([bool(re.search(r'(L|I)(?=[AS])',k)) for k in df.sum(axis=1)])

My idea was to transform the columns into a single column with the concatenation of the other columns. In this new column, I applied the logic test using regex. I used Positive lookbehind to check if there is an A or S after I saw an I or L. Whole code:

import pandas as pd
import numpy as np
import re

df=pd.read_csv("stack.txt",sep=",")

df['promo']=pd.Series([bool(re.search(r'(L|I)(?=[AS])',k)) for k in df.sum(axis=1)]).map({True:1, False:0})

print(df)

Returns:

  RFA RFB RFC RFD  promo
0   S   S   S   S      0
1   A   I   A   A      1
2   A   A   L   A      1
3   S   S   L   A      1
  • Although I don’t think it will work in my particular case because of Missing values(null), I think I would be completely correct!

  • well, I guess you just replace NULL with any string other than A,S,L or I, no?

  • In another situation it would be, but in this case null values are important because in fact they are not null values(they are simply moments when the customer was not yet partner to have access to the promotion) But replacing should work for sure! I will try it and if it works, I will put as correct too! Thank you very much!

  • Ok. Avoid nested for loop pq is an O(n 2) algorithm and therefore extremely inefficient.

  • I’ll take your advice! Although the sample is only 100k but still, it always pays!

  • 1

    After resolving to name the null by another letter...it worked right with a much better time! By the way, do you happen to know if you can put the promo column when there is condition? I mean, the promotion that got L to A...!

  • Perhaps it is the case to open another question

  • Okay! Just to avoid opening another topic!

Show 3 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.