Dataframe - Pandas. Creating new column after comparison between columns

Asked

Viewed 14,334 times

0

I have the following df

inserir a descrição da imagem aqui

I need to add the code of the CODIGO column in the cod_city column in cases where the value of the ds_city column equals the LOCALITY column.

I’ve tried that way, but it didn’t roll:

def acha_codigo(registro):
    if registro['ds_cidade'] == registro['LOCALIDADE']:
        return registro['CODIGO']
    else:
        return 'UNKNOWN'

Expected result:

inserir a descrição da imagem aqui

  • You say you need to "add the COLUMN code to the cod_city column," can you explain that further? I didn’t see a column called COLUMN. :-) It would be interesting to print the df and put the result instead of the image. It is easier to help.

  • Oops, my mistake. And the CODE column. This image is a print of df.head()

  • So... Always put the print result (contrl-c/control-v) instead of the image, so those who try to help can take advantage to do tests.

  • It would be interesting too, that you put as you want it to be at the end.

  • Ah, got it @Sidon. I’ll do it.

  • @Alexciuffa I think this related, but I don’t think it’s duplicate.

  • Now that the problem has become clearer, I really saw that it is not duplicate.

Show 3 more comments

1 answer

2

TL;DR

In the image of your example, there is no conicity between the columns LOCALIDADEand ds_cidade, so I created a DataFrame where one of the rows has these conicised columns, then I create a new column (nova_coluna), iteration on df to check if there are coincidences in the values of the cited columns, if yes, change the value of the nova_coluna for the value of the column CODIGO.

import pandas as pd

data = [(5200050, 'ABADIA DE GOIAS', 'GO', 'NAN', 'ABAETETUBA'), 
        (3100104, 'ABADIA DOS DOURADOS', 'MG', 'NAN', 'ABAETETUBA'),
        (5200100, 'ABADIANA', 'GO', 'NAN', 'ABADIANA'),
        (3100203, 'ABAETE', 'MG', 'NAN', 'ABAETETUBA'),
        (1500107, 'ABETETUBA', 'GO', 'NAN', 'ABAETETUBA')]


df = pd.DataFrame(data, columns = ['CODIGO', 'LOCALIDADE', 'UF', 'cod_cidade', 'ds_cidade'])

print(df)

Exit:

    CODIGO           LOCALIDADE  UF cod_cidade   ds_cidade
0  5200050      ABADIA DE GOIAS  GO        NAN  ABAETETUBA
1  3100104  ABADIA DOS DOURADOS  MG        NAN  ABAETETUBA
2  5200100             ABADIANA  GO        NAN    ABADIANA
3  3100203               ABAETE  MG        NAN  ABAETETUBA
4  1500107            ABETETUBA  GO        NAN  ABAETETUBA

Creating a new column in the DataFrame.

# Criando nova coluna
df['nova_coluna'] = 'NaN'

Iterating:

for index, row in df.iterrows():
    if row['LOCALIDADE'] == row['ds_cidade']:
        df.loc[index,'nova_coluna'] =  str(df.loc[index,'CODIGO'])  

# Apresentando o novo df:
print(df)

Exit, to the df modified:

    CODIGO           LOCALIDADE  UF cod_cidade   ds_cidade nova_coluna
0  5200050      ABADIA DE GOIAS  GO        NAN  ABAETETUBA         NaN
1  3100104  ABADIA DOS DOURADOS  MG        NAN  ABAETETUBA         NaN
2  5200100             ABADIANA  GO        NAN    ABADIANA     5200100
3  3100203               ABAETE  MG        NAN  ABAETETUBA         NaN
4  1500107            ABETETUBA  GO        NAN  ABAETETUBA         NaN

See working on repl.it.

  • Thanks for the answer, Sidon! But I believe I expressed myself in the wrong way. I added to the post an illustration of the expected result.

  • Still confused, the expected result that you put has nothing to do with the original, apart from the fact that you did not put the title of the columns in the expected result. I insist that you avoid the images here (get much less answers this way), try to create a df (my answer teaches you how to) fictional both for what you get qto for what you expect.

  • Okay, I made a new version by adding a new column (nova_coluna) in the df, just change her name to your context.

  • As soon as I can test, put a return here!

Browser other questions tagged

You are not signed in. Login or sign up in order to post.