How to get only the records not duplicated with pandas

Asked

Viewed 1,123 times

0

How would you get only the un-duplicated lines of a dataframe? Without them being single records, so df.unique() would not fit here. Only the ones that exist 1 same. I tried that way, but I don’t know if it’s right.

df2 = DF
df2.drop_duplicates('userId', keep=False, inplace=True)

So I would use the df2 where all those that are not duplicated would remain. This form is correct?

1 answer

3


Almost.

df2 = DF does not create a copy of DF, just give him one more name.

When you give drop_duplicates(..., inplace=True) modifications happen directly in the dataframe (i.e. your data frame loses duplicates). The way you did, duplicates would come out of DF, besides df2 (because in fact they are the same thing).

The correct would simply be:

df2 = DF.drop_duplicates('userId', keep=False)

This creates a copy of DF without data that has duplicates and puts it in df2.

  • The right thing would be without the (..., inplace=True) ?

  • Exact, and preferably without the df2 = DF at the beginning also, which loses use. The inplace=True makes the operation happen directly in the dataframe in which it is called, while without it the operation is done in a copy, which is returned by the function after.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.