Remove lines less frequently from pandas.dataframe

Asked

Viewed 136 times

-2

I own a dataframe with more than 13000 lines and would like to remove some based on the frequency with which they appear taking into account the column named variedade.

df.variedade.value_counts()

RB867515    5084

SP813250    2500

RB855453     981

others       849

RB855156     750

RB855536     633

SP832847     561

RB835054     541

SP801842     423

SP835073     326

RB835486     253

RB845210     199

SP803280     187

RB72454      164

RB966928     146

Name: variedade, dtype: int64

I would like to keep only the 3 varieties that most appear and delete the rest, thus changing the amount of lines to just over 8000.

I tried the command:

v = df[['variedade']]

df[v.replace(v.apply(pd.Series.value_counts)).gt(900).all(1)]

However, after asking for one value_counts column variedade appears that I have more than 13000 lines yet. Does anyone have any idea where I’m going wrong?

1 answer

0


Combine the value_counts with a head(3).index to create a mask with the elements that most appear in the Dataframe. After, with isin select them.

mask = df['variedade'].value_counts().head(3).index    
df = df.loc[df['variedade'].isin(mask)]
  • Terry, I tried this command, however after using a df.shape I still have more than 13000 lines... I would like to disappear with the lines that do not correspond to those 3 most frequent, but this did not happen.

  • To decrease the number of lines in your DF, you need to assign the return of the command .loc a variable. See the edition of my reply :)

  • Thanks for the tip! Helped a lot!!!

  • @Rafaelmansini If you accept my answer as the correct one, it also helps me :)

Browser other questions tagged

You are not signed in. Login or sign up in order to post.