How to fill a column of a DF Pandas using, as a comparison, a specific column between this and another DF?

Question

How to fill a column of a DF Pandas using, as a comparison, a specific column between this and another DF?

Asked 4 years, 8 months ago

Viewed 312 times

0

I have a Pandas dataframe, in which I need to add a new column called codprojeto. To do this, I created this new column and inserted zeros, so that it has the type int64, as follows:

df['codprojeto'] = 0

This new column should be fed with the codprojeto present in another dataframe. This column is of the type int64, reason I took the previous step.

As a comparison, I’m using the column CNPJ, with values in both Dfs. If equal, you must fill in df['codprojeto'] the value of df2['codprojeto'].

Attempt 1:

for i in range(len(df['CNPJ'])):
    for j in range(len(df2['CNPJ'])):
        if df2.loc[j, 'CNPJ'] == df.loc[i, 'CNPJ']:
            df.loc[i, 'codprojeto'] ==  df2.loc[j, 'codprojeto']

Returns:

Erro: KeyError: 4

Attempt 2:

for index, row in df.iterrows():
for index2, row2 in df2.iterrows():
    if str(row['CNPJ']) == row2['CNPJ']:
        df.loc[index,'codprojeto'] = df2.loc[index2,'codprojeto']

Returns: eternal execution; pending processing.

Data set (sample):

DF:

CNPJ,DATA,codprojeto
00000000000123,2020-12-02 00:00:00 UTC,0
99900000000123,2020-12-02 00:00:00 UTC,0
00000000000123,2020-12-02 00:00:00 UTC,0
00000000000123,2020-12-02 00:00:00 UTC,0
00000000000145,2020-12-02 00:00:00 UTC,0
00000000000123,2020-12-02 00:00:00 UTC,0
00000000000167,2020-12-02 00:00:00 UTC,0
00000000000167,2020-12-02 00:00:00 UTC,0
00000000000167,2020-12-02 00:00:00 UTC,0
00000000000167,2020-12-02 00:00:00 UTC,0
00000000000101,2020-12-02 00:00:00 UTC,0
00000000000122,2020-12-02 00:00:00 UTC,0
00000000000144,2020-12-02 00:00:00 UTC,0
00000000000123,2020-12-02 00:00:00 UTC,0
00000000000155,2020-12-02 00:00:00 UTC,0
00000000000155,2020-12-02 00:00:00 UTC,0
00000000000155,2020-12-02 00:00:00 UTC,0
00000000000166,2020-12-02 00:00:00 UTC,0
99900000000123,2020-12-02 00:00:00 UTC,0
99900000000123,2020-12-02 00:00:00 UTC,0

DF2:

"codcliente";"nome";"CNPJ";"codprojeto"
1;"CLIENTE 1";"00000000000123";1234
2;"CLIENTE 1";"00000000000145";5678
3;"CLIENTE 1";"00000000000167";9012
4;"CLIENTE 1";"00000000000189";3456
5;"CLIENTE 1";"00000000000101";7890
6;"CLIENTE 1";"00000000000122";11
7;"CLIENTE 1";"00000000000133";22
8;"CLIENTE 1";"00000000000144";33
9;"CLIENTE 9";"00000000000155";44
10;"CLIENTE 10";"00000000000166";55

The original DF and DF2 have respectively 635939 and 1054 lines.

For cases that there is no CNPJ compatible, leave as 0 the codprojeto.

How can I fix this?

Marlos, good afternoon! Can you make the data set available? Hug!

– lmonferrari

2020/12/02 at 15:02
Without the data it is difficult to answer, but I tell you that 1) This solution is extremely slow, use np.where instead of nested loops O(n 2); 2) On the last line the correct is = and not ==; 3) The error is possibly occurring pq 4 is not in the index of df or df2

– Lucas

2020/12/02 at 15:15
Includes the dataset in the question. About the '==' in the last line, I have already made the correction. About Keyerror 4, it’s strange, because I’m not looking for value 4, but comparing between 2 separate values.

– marloswn

2020/12/02 at 16:36
A tip to always get objective answers that meet you is to disbonibilize samples of data that contemplate all the problems you are going through. With the data you disbonibilizou you received 2 different responses that apitaram 2 different errors. There is a unique topic showing how to create a minimum, complete and verifiable example

– Terry

2020/12/03 at 14:40
Easy, thanks for the tip, Terry.

– marloswn

2020/12/04 at 11:43

2 answers

2

My suggestion is to make a LEFT JOIN using the function merge(). And for it to work properly it is not necessary to create the column codproject with zeros in Dataframe 'DF'

# aqui vou excluir a coluna 'codprojeto' de df, 
# mas acredito que no seu projeto seja necessário 
# apenas comentar a linha onde ela é criada
df = df.drop(columns= ['codprojeto'])

df = df.merge(df2[['CNPJ', 'codprojeto']], on = 'CNPJ', how = 'left').fillna(0)
#para não ter valores duplicados use
#df = df.merge(df2[['CNPJ', 'codprojeto']].drop_duplicates('CNPJ'), on = 'CNPJ', how = 'left').fillna(0)

#para converter a nova coluna para int
df['codprojeto'] = df['codprojeto'].astype('int64')

#saida:
    CNPJ    DATA                    codprojeto
0   123     2020-12-02 00:00:00 UTC 1234.0
1   999...  2020-12-02 00:00:00 UTC 0.0
2   123     2020-12-02 00:00:00 UTC 1234.0
3   123     2020-12-02 00:00:00 UTC 1234.0
4   145     2020-12-02 00:00:00 UTC 5678.0
5   123     2020-12-02 00:00:00 UTC 1234.0
6   167     2020-12-02 00:00:00 UTC 9012.0
7   167     2020-12-02 00:00:00 UTC 9012.0
8   167     2020-12-02 00:00:00 UTC 9012.0
9   167     2020-12-02 00:00:00 UTC 9012.0
10  101     2020-12-02 00:00:00 UTC 7890.0
11  122     2020-12-02 00:00:00 UTC 11.0
12  144     2020-12-02 00:00:00 UTC 33.0
13  123     2020-12-02 00:00:00 UTC 1234.0
14  155     2020-12-02 00:00:00 UTC 44.0
15  155     2020-12-02 00:00:00 UTC 44.0
16  155     2020-12-02 00:00:00 UTC 44.0
17  166     2020-12-02 00:00:00 UTC 55.0
18  999...  2020-12-02 00:00:00 UTC 0.0
19  999...  2020-12-02 00:00:00 UTC 0.0

Same question, Terry: column codprojeto in df gets zeroed.

– marloswn

2020/12/02 at 23:30
I believe that the CNPJ column type is not the same in both dataframes. Try to convert this column to number using the command : df['CNPJ'] = df['CNPJ'].astype('int64'). Do the same in df2 and then merge

– Terry

2020/12/03 at 01:50
Terry, then the error happens: Valueerror: invalid literal for int() with base 10: '00.000.000'

– marloswn

2020/12/03 at 11:54
Apparently in one of its Dataframes the CNPJ field has points, which is a problem to convert to the numeric type. Search or create a new question in the OS-PT on how to turn numbers with points to int64 into pandas, giving examples and how to end up

– Terry

2020/12/03 at 14:35
Terry, it worked! They had three test lines with '00.000.000'. When I removed them, it spun. However, it stored as a float (including in your example above). Do you want to convert to int? In addition, 14,773 more lines have been added to the dataframe, while waiting to keep 636,042 records.

– marloswn

2020/12/04 at 11:53
if there is no null value to convert to int, this way: df['codprojeto'] = df['codprojeto'].astype('int64')

– Terry

2020/12/04 at 11:58
About increasing the number of lines, enters what I commented on the @imonferrari response, I suggested using the drop_duplicates along with the mapand you said you couldn’t use it because each CNPJ had unique information inside DF2

– Terry

2020/12/04 at 12:00
I edited my answer with commands showing how not to have Duplicates data and how to convert int to column codprojeto

– Terry

2020/12/04 at 12:02
I didn’t understand how to use the drop inside the map. Could you explain to me, please?

– marloswn

2020/12/04 at 12:07
I put an example in my answer that avoids creating extra line, just take a look at the update

– Terry

2020/12/04 at 12:10
1

Terry, it worked! That’s exactly what I needed. Thank you so much!!!

– marloswn

2020/12/04 at 13:59

Show 6 more comments

Browser other questions tagged python pandas

You are not signed in. Login or sign up in order to post.

by lmonferrari • **3,550** points · Answer 1 · 2020-12-02T17:26:58+00:00

Importing the pandas

import pandas as pd

Loading the test files

df = pd.read_csv('./df.csv')
df2 = pd.read_csv('./df2.csv', sep = ';')

df

    CNPJ        DATA              codprojeto
0   123  2020-12-02 00:00:00 UTC      0
1   123  2020-12-02 00:00:00 UTC      0
2   123  2020-12-02 00:00:00 UTC      0
3   123  2020-12-02 00:00:00 UTC      0
4   145  2020-12-02 00:00:00 UTC      0
5   123  2020-12-02 00:00:00 UTC      0
6   167  2020-12-02 00:00:00 UTC      0
7   167  2020-12-02 00:00:00 UTC      0
8   167  2020-12-02 00:00:00 UTC      0
9   167  2020-12-02 00:00:00 UTC      0
10  101  2020-12-02 00:00:00 UTC      0
11  122  2020-12-02 00:00:00 UTC      0
12  144  2020-12-02 00:00:00 UTC      0
13  123  2020-12-02 00:00:00 UTC      0
14  155  2020-12-02 00:00:00 UTC      0
15  155  2020-12-02 00:00:00 UTC      0
16  155  2020-12-02 00:00:00 UTC      0
17  166  2020-12-02 00:00:00 UTC      0
18  123  2020-12-02 00:00:00 UTC      0
19  123  2020-12-02 00:00:00 UTC      0

df2

    codcliente  nome        CNPJ    codprojeto
0   1           CLIENTE 1   123     1234
1   2           CLIENTE 1   145     5678
2   3           CLIENTE 1   167     9012
3   4           CLIENTE 1   189     3456
4   5           CLIENTE 1   101     7890
5   6           CLIENTE 1   122       11
6   7           CLIENTE 1   133       22
7   8           CLIENTE 1   144       33
8   9           CLIENTE 9   155       44
9   10          CLIENTE 10  166       55

Mapping the df Series to assign values to the match in df2

df['codprojeto'] = df['CNPJ'].map(df2.set_index('CNPJ')['codprojeto']).fillna(0)

Printing

df

Exit:

    CNPJ             DATA           codprojeto
0   123   2020-12-02 00:00:00 UTC   1234
1   123   2020-12-02 00:00:00 UTC   1234
2   123   2020-12-02 00:00:00 UTC   1234
3   123   2020-12-02 00:00:00 UTC   1234
4   145   2020-12-02 00:00:00 UTC   5678
5   123   2020-12-02 00:00:00 UTC   1234
6   167   2020-12-02 00:00:00 UTC   9012
7   167   2020-12-02 00:00:00 UTC   9012
8   167   2020-12-02 00:00:00 UTC   9012
9   167   2020-12-02 00:00:00 UTC   9012
10  101   2020-12-02 00:00:00 UTC   7890
11  122   2020-12-02 00:00:00 UTC     11
...