Removing lines from a dataframe that meet a certain condition

Question

Removing lines from a dataframe that meet a certain condition

Asked 6 years, 2 months ago

Viewed 16,188 times

1

Hello, I am trying to manipulate a dataframe by python 3.x API pandas with some data to be analyzed, and I need to remove the lines that meet certain conditions.

The dataframe has the following format

           coluna 1     coluna 2     coluna 3     coluna 4
df_final=     x1           a1           y1           b1
              x2           a2           y2           b2
              x3           a3           y3           b3
              x4           a4           y4           b4

What I need to do is eliminate the lines where:

a is less than a predetermined value (ex:a < 5)
b is less than the same predetermined value for a (ex: b < 5)
x + y is greater than a predetermined value (eg x + y > 7)

What I tried to do was use the . drop on pandas, but I couldn’t get exactly what I wanted

    df_final.drop(df_final[(df_final['Coluna 2'] < minimo) &
                              (df_final['Coluna 4'] < minimo) &
                              ((df_final['Coluna 1'] + df_final['Coluna 3']) > valor)])

Edit:

My data is like this:

    In [25]: df
    Out[25]: 
    Nº fio 1  Diâmetro fio 1  Nº fio 2  Diâmetro fio 2  Seção total
0          1            0.60         0            0.00        0.283
1          1            0.63         0            0.00        0.312
2          1            0.67         0            0.00        0.353
3          1            0.71         0            0.00        0.396
4          1            0.75         0            0.00        0.442
5          1            0.80         0            0.00        0.503
6          1            0.85         0            0.00        0.567
7          2            0.60         0            0.00        0.565
8          2            0.63         0            0.00        0.623
9          2            0.67         0            0.00        0.705
10         2            0.71         0            0.00        0.792
11         2            0.75         0            0.00        0.884
12         2            0.80         0            0.00        1.005
13         2            0.85         0            0.00        1.135
14         3            0.71         0            0.00        1.188
15         3            0.75         0            0.00        1.325
16         3            0.80         0            0.00        1.508
17         3            0.85         0            0.00        1.702
18         1            0.67         1            0.60        0.635
19         1            0.67         2            0.60        0.918
20         2            0.67         1            0.60        1.271
21         1            0.71         1            0.63        0.708
22         1            0.71         2            0.63        1.019
23         2            0.71         1            0.63        1.415
24         1            0.75         1            0.67        0.794
25         1            0.75         2            0.67        1.147
26         2            0.75         1            0.67        1.589
27         1            0.80         1            0.71        0.899
28         1            0.80         2            0.71        1.294
29         2            0.80         1            0.71        1.797
30         1            0.85         1            0.75        1.009
31         1            0.85         2            0.75        1.451
32         2            0.85         1            0.75        2.018
33         1            0.85         1            0.80        1.070
34         1            0.85         2            0.80        1.573
35         2            0.85         1            0.80        2.140
36         1            0.63         1            0.60        0.594
37         1            0.63         2            0.60        0.877
38         2            0.63         1            0.60        0.906
39         1            0.67         1            0.63        0.664
40         1            0.67         2            0.63        0.976
41         2            0.67         1            0.63        1.017
42         1            0.71         1            0.67        0.748
43         1            0.71         2            0.67        1.101
44         2            0.71         1            0.67        1.144
45         1            0.75         1            0.71        0.838
46         1            0.75         2            0.71        1.234
47         2            0.75         1            0.71        1.279
48         1            0.80         1            0.75        0.944
49         1            0.80         2            0.75        1.386
50         2            0.80         1            0.75        1.447

It is a combination, what cannot occur is to have more than 2(column 1 and 3, disregarding the indexes) when it is less than . 71

I mean, the data I want to remove is:

       Nº fio 1  Diâmetro fio 1  Nº fio 2  Diâmetro fio 2  Seção total
19         1            0.67         2            0.60        0.918
20         2            0.67         1            0.60        1.271
37         1            0.63         2            0.60        0.877
38         2            0.63         1            0.60        0.906
40         1            0.67         2            0.63        0.976
41         2            0.67         1            0.63        1.017

3 answers

5

To contribute to the thread, I suggest a solution using a mask to select the desired data, follow the performance tests:

Using Loc and drop

%%timeit
df_remove = df_final.loc[(df_final['Diametrofio1'] < .71) 
                         | (df_final['Diametrofio2'] < .71) 
                         & ((df_final['Nfio1'] + df_final['Nfio2']) > 2)]

ultimo_df = df_final.drop(df_remove.index)

4.53 ms 65 µs per loop (Mean Std. dev. of 7 runs, 100 loops each)

Using Mask and Loc

%%timeit
mask = (df_final['Diametrofio1'] < .71) | (df_final['Diametrofio2'] < .71) & ((df_final['Nfio1'] + df_final['Nfio2']) > 2)

ultimo_df = df_final.loc[~mask]
#ou df_final= df_final.loc[~mask]

3.63 ms 100 µs per loop (Mean Std. dev. of 7 runs, 100 loops each)

A few milliseconds faster

It is a good alternative even, but I think one step is missing, to remove the items from the original df, as I understood in your code, you get the df with what I want to remove but not remove it yet.

– Lucas

2019/05/21 at 18:56
@Lucas I saw now that my suggestion had an error within the loc the Mask should be denied. The correct is ~mask. Now the answer is correct

– Terry

2019/05/21 at 19:56

Browser other questions tagged python-3.x pandas

You are not signed in. Login or sign up in order to post.

by Lucas • 41 points · Answer 1 · 2019-05-21T16:42:03+00:00

I was able to sort it out, sort of mixed Gabriel’s idea with another form of search: First I create a dataframe with the data I don’t want:

    df_remove = df_final.loc[((df_final['Diâmetro fio 1'] < minBitolaPref) 
                             | (df_final['Diâmetro fio 2'] < minBitolaPref))
                             & ((df_final['Nº fio 1'] + df_final['Nº fio 2']) > nFmin)]

After I remove from the original dataframe the values based on the df_remove indices, which are kept.

df_final = df_final.drop(df_remove.index)

Edit:

The first two conditions were one or the other, not one and the other, so I changed (cond1 & cond2 & cond3) for (cond1 | cond2) & cond3

by Gabriel Machado • 1 point · Answer 2 · 2019-05-21T15:02:33+00:00

Written that way, the code is not very readable. What you have to do, in fact, is find the indexes that meet your condition and then pass them to the drop. In that case, I would use a list comprehension.

Indices=[x for x in df.index if #as condições que vc quer#]
df.drop(índices)