How to loop a dataframe by conditionally removing lines and restarting the loop in a recalculated dataframe at each removal?

Asked

Viewed 2,511 times

0

Hello,

I am performing some data processing and I need my algorithm to perform the comparison between two lines and remove the worst one, according to some conditions.

Each time the algorithm performs the removal, it must recalculate the dataframe, as one of the conditions depends on the relation of the bottom line to the top.

My dataframe has the following format

df = n° de fios 1     diametro 1     n° de fios 2     diametro 2     seção total   relação
           n11             d11             n21             d21           st1       st2/st1
           n12             d12             n22             d22           st2       st3/st2

This goes on for several lines.

What my algorithm does is:

  • Reads the dataframe excel, which has the first 5 columns
  • Calculate the last column
  • Add another column with a rank based on the first column (coming from a Dict)
  • Run the lines through a loop for n in range(0, len(df)-1)
  • If the ratio is less than 1.03, you should remove one of the two lines
  • There is a set of conditions that must be met for the line n+1 is removed, otherwise removes the line n (the conditions include the rank).
  • When a line is removed, reset the index, removes the column rank, adds the dataframe à um excel and applies the loop in the new excel.

My code is this::

class Rank_and_drop(object):
    def __init__(self): # define os dicts
    #exemplo
    self.dict_mc_112_2p_cu = {'0.71': 0, '0.75': 1, '0.67': 2, '0.8': 3, '0.85': 4, '0.63': 5, '0.45': 6, '0.475': 7, '0.5': 8, '0.53': 9, '0.56': 10, '0.6': 11}
def recalc_df(self, file, sheet):
    # lê a tabela
    df_final = pd.read_excel(file, sheet_name=sheet)
    # converte os numeros para strings para poder usar o self.dict
    df_final = df_final.astype(str)

    df_final = df_final.drop(columns=['rsc'])
    df_final = df_final.sort_values(by='sc')
    df_final = df_final.reset_index(drop=True)

    aux = df_final.loc[:, 'sc']
    aux = aux.reset_index(drop=True)

    aux2 = aux.drop(0)
    aux2 = aux2.reset_index(drop=True)
    aux2 = aux2.append(aux2.tail(1), ignore_index=True)

    for x in range(len(aux)):
        aux[x] = (round(float(aux2[x])/float(aux[x]), 3))

    aux = aux.rename('rsc')

    df_final['rsc'] = aux

    df_final = df_final.astype(str)

    df_final['rnk'] = df_final['d1'].map(dict_)

    return df_final

def loop(self, file, sheet):
    df = self.recalc_df(file, sheet)

    for n in range(0,len(df)-1):
        c_0 = float(df.at[n, 'rsc']) <= 1.03

        c_1 = (df.at[n, 'n2'] == '0') and (df.at[n+1, 'n2'] == '0')
        c_1_1 = df.at[n+1, 'rnk'] >= df.loc[n, 'rnk']
        c_1_2 = float(df.at[n, 'n1']) < float(df.at[n+1, 'n1'])
        c_2 = (df.at[n, 'n2'] == '0') and (df.at[n+1, 'n2'] != '0')
        c_4 = (df.at[n, 'n2'] != '0') and (df.at[n+1, 'n2'] != '0')
        c_4_1 = df.at[n+1, 'rnk'] == df.at[n, 'rnk']
        c_4_1_1 = float(df.at[n+1, 'd2']) < float(df.at[n, 'd2'])
        c_4_2 = df.at[n+1, 'rnk'] > df.at[n, 'rnk']

        if c_0:
            if (((c_1) and ((c_1_1) or (c_1_2))) or (c_2) or ((c_4) and ((((c_4_1) and (c_4_1_1)) or (c_4_2))))):
                df = df.drop(n+1)

            else:
                df = df.drop(n)


            df = df.reset_index(drop=True)
            df = df.drop(columns=['rnk'])
            df.to_excel('{0}.xlsx'.format(sheet), index=False, sheet_name=sheet)
            self.loop('{0}.xlsx'.format(sheet), sheet)

        else:
            pass

        n = n+1

    df.to_excel('{0}.xlsx'.format(sheet), index=False, sheet_name=sheet)
    return df 

Now, the problem is that mine loop doesn’t work the way I expect... It starts correctly, and to some extent even does what is expected, but from a certain point, the algorithm seems to give a bug, and removes several lines that it should not, and at the end returns a KeyError.

Any idea what I can do to accomplish this task, more optimally and without these errors?

Edit: I have several dataframes that I want to do this, of varying sizes, in the smallest of them (35 lines), the code correctly performs the function, however, returns the KeyError. In another dataframe, 690 lines, the problem of removing multiple lines occurs.

  • Hi @Lucas, I suggest you "disassemble" a little this loop in a slightly smaller function, to make it easier to debug, usually when working with pandas I avoid doing loops, and when there is no way, I give preference to native iterators of the class iterrows or tuple

  • I didn’t understand your idea, the loop is simple, but it has several conditions, I tried to use a while and solved my problem in parts... But the keyerror still occurs, I don’t know what else to do...

  • I think I understood your question correctly and published an answer. However, next time try to provide an example minimum, complete and verifiable, that facilitates the understanding of users who intend to help, or who have the same doubt as you :)

1 answer

1


Reproducing the defect

Note the following code

import pandas as pd

df = pd.DataFrame(
    [   
        [6, 5, 7, 3],
        [1, 7, 8, 9],
        [7, 8, 3, 10],
    ],
    columns=list('ABCD'))

# Condição de remoção: uma linha deve ser removida se seu valor na coluna "A" é maior que 5
condição = lambda dados, n: dados.loc[n, "A"] > 5

for n in range(0,len(df)-1):
    print("*"*30)
    print("Antes:\n%s"%df)
    if condição(df, n):
        print("Condição verdadeira na linha %s:\n%s\n" % (n, df.loc[[n]]))
        df = df.drop(n+1)
    else:
        print("Condição falsa na linha %s:\n%s\n" % (n, df.loc[[n]]))
        df = df.drop(n)
    print("Depois:\n%s\n\n"%df)

Run this code, generates the following output:

Antes:
   A  B  C   D
0  6  5  7   3
1  1  7  8   9
2  7  8  3  10
Condição verdadeira na linha 0:
   A  B  C  D
0  6  5  7  3

Depois:
   A  B  C   D
0  6  5  7   3
2  7  8  3  10


******************************
Antes:
   A  B  C   D
0  6  5  7   3
2  7  8  3  10
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/pandas/core/indexing.py", line 1790, in _validate_key
    error()
  File "/usr/lib/python3.6/site-packages/pandas/core/indexing.py", line 1785, in error
    axis=self.obj._get_axis_name(axis)))
KeyError: 'the label [1] is not in the [index]'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "outro.py", line 18, in <module>
    if condição(df, n):
  File "outro.py", line 13, in <lambda>
    condição = lambda dados, n: dados.loc[n, "A"] > 5
  File "/usr/lib/python3.6/site-packages/pandas/core/indexing.py", line 1472, in __getitem__
    return self._getitem_tuple(key)
  File "/usr/lib/python3.6/site-packages/pandas/core/indexing.py", line 870, in _getitem_tuple
    return self._getitem_lowerdim(tup)
  File "/usr/lib/python3.6/site-packages/pandas/core/indexing.py", line 998, in _getitem_lowerdim
    section = self._getitem_axis(key, axis=i)
  File "/usr/lib/python3.6/site-packages/pandas/core/indexing.py", line 1911, in _getitem_axis
    self._validate_key(key, axis)
  File "/usr/lib/python3.6/site-packages/pandas/core/indexing.py", line 1798, in _validate_key
    error()
  File "/usr/lib/python3.6/site-packages/pandas/core/indexing.py", line 1785, in error
    axis=self.obj._get_axis_name(axis)))
KeyError: 'the label [1] is not in the [index]'

As can be observed, the first execution of the loop erases line 1. In the second step, one tries to access the same line 1 that has been removed, and then the KeyError.

You might have drawn up the algorithm expecting the dataframe to behave similar to lists. So, you could add a

df = df.reset_index(drop=True)

soon after the removal of the line to reorder the indexes, but this still would not solve the problem. Note that, your loop must access n-1 positions of the dataframe, and therefore if during loop execution, the code removes 2 dataframe lines, when arriving at the time of accessing the position n-1 in the dataframe of n-2 lines, another KeyError would occur.

Solutions

  • Reconsidering line removal conditions - First of all, one must evaluate whether the objective of the program is correct. In many data sets, the lines are independent of each other, and therefore it is unusual to be required to remove a line from the evaluation of other lines.
  • Modifying the program to work properly - As we have seen, modify a Dataframe (or lists) while trying to iterate through your indexes is problematic. An alternative would be to build the list of indexes to be removed, then remove them with a df.drop only.

In the above example, simply replace the for loop with:

índicesParaSeremRemovidos = [n for n in range(len(df)) if condição(df,n)] + [n+1 for n in range(len(df)) if not condição(df,n) and n<3]
df.drop(índicesParaSeremRemovidos, inplace=True)
print(df)

resulting in

   A  B  C  D
1  1  7  8  9

Browser other questions tagged

You are not signed in. Login or sign up in order to post.