How to filter, select and count data on a pandas.Dataframe?

Asked

Viewed 23,453 times

5

How to get quantity of records based on multiple columns of a given name?

My dataframe looks something like this:

import pandas as pd

df = pd.DataFrame([["1111", True, True, False, True, True],
                   ["2222", True, False, True, True, False],
                   ["3333", True, False, True, True, True]],
                  columns=["id", "coluna_qualquer", "x_a", "x_b", "x_c", "x_d"])

I want the number of rows where there are at least three columns with the value True, but considering only the columns that start with "x_", and do not consider the value of other columns (such as the "column"). In this example, the lines with Ids "1111" and "3333" would respect this condition, that is, the return I want would be 2.

How to do this using pandas?

1 answer

3


summing up

A few separate steps are required for this. The Dataframe object is made so that the call to a method returns a new modified dataframe, and you can already concatenate the next operation directly. Then, to filter all rows where the columns "x_*" are more than 3 True and take the total number, just do:

In [98]: (df.filter(like='x_').sum(axis=1) >= 3).sum()
Out[98]: 2

Let’s go in pieces

The first thing is to select a subdataframe with the desired columns. Pandas has the method filter which allows this - only the columns containing the text passed in the argument like are selected:

In [91]: filtered_df = df.filter(like='x_')

In [92]: filtered_df
Out[92]: 
     x_a    x_b   x_c    x_d
0   True  False  True   True
1  False   True  True  False
2  False   True  True   True

(If Pandas didn’t have this, the way would be to use pure Python to select the names of the desired columns

...
data_columns = [col_name for col_name in df.columns if col_name.startswith("x_")]

And then method loc of the dataframe, which accepts the names of a "select all", leaving the value ::

filtered_df = df.loc[:, data_columns]

)

Right now you only have the columns that interest you, and we can count -

     x_a    x_b   x_c    x_d
0   True  False  True   True
1  False   True  True  False
2  False   True  True   True

Here, we can abuse a Python feature - the values False and True is a subclass of integers, and can participate in a sum as if they were 0 and 1 respectively. So, the method sum of the dataframe itself can give the sum value of each row of the table (we just need to indicate that we want the sum of the passing lines axis=1, otherwise the sum results in the sum of the values in each column:

In [93]: count_df = filtered_df.sum(axis=1)

In [94]: count_df
Out[94]: 
0    3
1    2
2    3
dtype: int64

(If the value to be located was not True, or the desired was not only to count the occurrences, instead of the .sum, we would use the .apply - that allows you to pass a generic function that will receive each dataframe row (or each column if Axis==0), and generate a result.)

And finally, to know how many of these lines have value above 3 - we apply the operator >= 3 pandas redefines all binary operators - whether arithmetic or comparison, to create a new dataframe, with the result of the operation in each cell - i.e.:

In [95]: count_df >= 3
Out[95]: 
0     True
1    False
2     True
dtype: bool

And then just repeat the sum, this time letting him add the True in the column:

In [95]: (count_df >= 3).sum()
Out[95]: 2

Browser other questions tagged

You are not signed in. Login or sign up in order to post.