-2
I own a dataframe
with more than 13000 lines and would like to remove some based on the frequency with which they appear taking into account the column named variedade
.
df.variedade.value_counts()
RB867515 5084
SP813250 2500
RB855453 981
others 849
RB855156 750
RB855536 633
SP832847 561
RB835054 541
SP801842 423
SP835073 326
RB835486 253
RB845210 199
SP803280 187
RB72454 164
RB966928 146
Name: variedade, dtype: int64
I would like to keep only the 3 varieties that most appear and delete the rest, thus changing the amount of lines to just over 8000.
I tried the command:
v = df[['variedade']]
df[v.replace(v.apply(pd.Series.value_counts)).gt(900).all(1)]
However, after asking for one value_counts
column variedade
appears that I have more than 13000 lines yet.
Does anyone have any idea where I’m going wrong?
Rafael, if my answer solved your problem, you can mark it as accepted. See the importance of this link how and why to accept an answer
– Terry