How to change duplicate data in a dataframe?

Question

How to change duplicate data in a dataframe?

Asked 4 years, 6 months ago

Viewed 63 times

1

I am trying to automate a process that I do manually in excel. That is to extract the company’s employee base from excel, select some specific columns (because the file is too large), remove certain level of hierarchy and filter some companies. so far it has been. if you want to give any suggestions for better it will be very welcome. however in the name column, has some duplicate names are really different people. It is necessary to keep duplicate. my doubt is when I do in excel I put "." at the end of each name to differentiate, I can do it by python? I am using googlecolab.

Obs: when I run it presents 2 errors

WARNING *** file size (7827463) not 512 + Multiple of sector size (512)
/usr/local/lib/python3.6/dist-Packages/ipykernel_launcher.py:6: Userwarning: Boolean Series key will be reindexed to match Dataframe index. is normal?

view = pd.read_excel ("/content/View.xls")
filtro = view['Emp'] < 4
filtro2 = view['Hierarquia Cargo'] > 3
view1 = view[filtro]
view5 = view1[filtro2]
bd = view5 [['Nome', 'Emp', 'EST', 'Matr', 'Nome Estabelecimento', 'Descr Unid Lotacao', 'Descr CC', 'Desc Afast']]
bd = bd.sort_values (by='Nome', ascending=True)
display (bd)

1 answer

Browser other questions tagged python

You are not signed in. Login or sign up in order to post.

by Paulo Marques • **3,739** points · Answer 1 · 2021-02-09T19:15:58+00:00

You can rename duplicates yes. See below

Creating Dataframe Test

>>> import pandas as pd

>>> df = pd.DataFrame({"frutas": ["banana", "goiaba", "laranja", "banana", "uva", "laranja", "banana"]})

>>> df
    frutas
0   banana
1   goiaba
2  laranja
3   banana
4      uva
5  laranja
6   banana

Renaming duplicates

>>> df["frutas"] = df.frutas.where(~df.frutas.duplicated(), df.frutas + '.')

>>> df
     frutas
0    banana
1    goiaba
2   laranja
3   banana.
4       uva
5  laranja.
6   banana.

Realize that there is a banana and two banana.... This exemplifies the case that you have several people with the same name.

Spinning once again to take out the second case of `banana.`

>>> df["frutas"] = df.frutas.where(~df.frutas.duplicated(), df.frutas + '.')

>>> df
     frutas
0    banana
1    goiaba
2   laranja
3   banana.
4       uva
5  laranja.
6  banana..

How to change duplicate data in a dataframe?

1 answer

Creating Dataframe Test

Renaming duplicates

Spinning once again to take out the second case of banana.

Spinning once again to take out the second case of `banana.`