Python/Pandas - How to create a data frame that contains the original line and duplicate line

Asked

Viewed 5,844 times

1

In a date frame that contains two lines with Pais = India, I was able to create a duplicity-free data frame with only one line from India A data frame with duplicate line only I need to create a data frame that contains only the two lines of the country = India How can I do that?

import pandas as pd
import numpy as np
data = {
'País': ['Bélgica', 'Índia', 'Brasil','Índia'],
'Capital': ['Bruxelas', 'Nova Delhi', 'Brasília', 'Nova Delhi'],
'População': [123465, 456789, 987654, 456789]
}
# gera DF excluindo as linhas duplicadas
drop_df = df.drop_duplicates()
# gera data frame somente com as duplicidades 
dfdrop = df[df.duplicated() == True]

How to generate a DF only with the two lines of the Country India???

2 answers

2

(TL;DR)
Building the dataframe from the data:

import pandas as pd
import numpy as np
from collections import OrderedDict
data = OrderedDict(
{
'País': ['Bélgica', 'Índia', 'Brasil','Índia'],
'Capital': ['Bruxelas', 'Nova Delhi', 'Brasília', 'Nova Delhi'],
'População': [123465, 456789, 987654, 456789]
})

df = pd.DataFrame(data)

Featuring the original dataframe:

df

Output:

Dataframe original

Purging the duplicates of:

df_clean = df.drop_duplicates()
df_clean

output:

Dataframe sem duplicidades

Selecting the duplicates:

paises = df.País
df_duplicates = df[paises.isin(paises[paises.duplicated()])]
df_duplicates

Output:

Dataframe somente com os duplicados

See code running on a notebook jupyter.

0

Searching for the duplicated command, it has the Keep parameter Keep - last marks only the second as duplicate (default) = first mark only the first = False marks both
Then to create the dataframe with duplicate lines just run the command: dfdrop = df[df.duplicated('Country', Keep=False) == True]

Browser other questions tagged

You are not signed in. Login or sign up in order to post.