summing up
A few separate steps are required for this. The Dataframe object is made so that the call to a method returns a new modified dataframe, and you can already concatenate the next operation directly. Then, to filter all rows where the columns "x_*" are more than 3 True
and take the total number, just do:
In [98]: (df.filter(like='x_').sum(axis=1) >= 3).sum()
Out[98]: 2
Let’s go in pieces
The first thing is to select a subdataframe with the desired columns.
Pandas has the method filter
which allows this - only the columns containing the text passed in the argument like
are selected:
In [91]: filtered_df = df.filter(like='x_')
In [92]: filtered_df
Out[92]:
x_a x_b x_c x_d
0 True False True True
1 False True True False
2 False True True True
(If Pandas didn’t have this, the way would be to use pure Python to select the names of the desired columns
...
data_columns = [col_name for col_name in df.columns if col_name.startswith("x_")]
And then method loc
of the dataframe, which accepts the names of a "select all", leaving the value :
:
filtered_df = df.loc[:, data_columns]
)
Right now you only have the columns that interest you, and we can count -
x_a x_b x_c x_d
0 True False True True
1 False True True False
2 False True True True
Here, we can abuse a Python feature - the values False
and True
is a subclass of integers, and can participate in a sum as if they were 0 and 1 respectively. So, the method sum
of the dataframe itself can give the sum value of each row of the table (we just need to indicate that we want the sum of the passing lines axis=1
, otherwise the sum
results in the sum of the values in each column:
In [93]: count_df = filtered_df.sum(axis=1)
In [94]: count_df
Out[94]:
0 3
1 2
2 3
dtype: int64
(If the value to be located was not True, or the desired was not only to count the occurrences, instead of the .sum
, we would use the .apply
- that allows you to pass a generic function that will receive each dataframe row (or each column if Axis==0), and generate a result.)
And finally, to know how many of these lines have value above 3 - we apply the operator >= 3
pandas redefines all binary operators - whether arithmetic or comparison, to create a new dataframe, with the result of the operation in each cell - i.e.:
In [95]: count_df >= 3
Out[95]:
0 True
1 False
2 True
dtype: bool
And then just repeat the sum
, this time letting him add the True
in the column:
In [95]: (count_df >= 3).sum()
Out[95]: 2