How to fill a column of a DF Pandas using, as a comparison, a specific column between this and another DF?

Asked

Viewed 312 times

0

I have a Pandas dataframe, in which I need to add a new column called codprojeto. To do this, I created this new column and inserted zeros, so that it has the type int64, as follows:

df['codprojeto'] = 0

This new column should be fed with the codprojeto present in another dataframe. This column is of the type int64, reason I took the previous step.

As a comparison, I’m using the column CNPJ, with values in both Dfs. If equal, you must fill in df['codprojeto'] the value of df2['codprojeto'].

Attempt 1:

for i in range(len(df['CNPJ'])):
    for j in range(len(df2['CNPJ'])):
        if df2.loc[j, 'CNPJ'] == df.loc[i, 'CNPJ']:
            df.loc[i, 'codprojeto'] ==  df2.loc[j, 'codprojeto']

Returns:

Erro: KeyError: 4

Attempt 2:

for index, row in df.iterrows():
for index2, row2 in df2.iterrows():
    if str(row['CNPJ']) == row2['CNPJ']:
        df.loc[index,'codprojeto'] = df2.loc[index2,'codprojeto'] 

Returns: eternal execution; pending processing.

Data set (sample):

DF:

CNPJ,DATA,codprojeto
00000000000123,2020-12-02 00:00:00 UTC,0
99900000000123,2020-12-02 00:00:00 UTC,0
00000000000123,2020-12-02 00:00:00 UTC,0
00000000000123,2020-12-02 00:00:00 UTC,0
00000000000145,2020-12-02 00:00:00 UTC,0
00000000000123,2020-12-02 00:00:00 UTC,0
00000000000167,2020-12-02 00:00:00 UTC,0
00000000000167,2020-12-02 00:00:00 UTC,0
00000000000167,2020-12-02 00:00:00 UTC,0
00000000000167,2020-12-02 00:00:00 UTC,0
00000000000101,2020-12-02 00:00:00 UTC,0
00000000000122,2020-12-02 00:00:00 UTC,0
00000000000144,2020-12-02 00:00:00 UTC,0
00000000000123,2020-12-02 00:00:00 UTC,0
00000000000155,2020-12-02 00:00:00 UTC,0
00000000000155,2020-12-02 00:00:00 UTC,0
00000000000155,2020-12-02 00:00:00 UTC,0
00000000000166,2020-12-02 00:00:00 UTC,0
99900000000123,2020-12-02 00:00:00 UTC,0
99900000000123,2020-12-02 00:00:00 UTC,0

DF2:

"codcliente";"nome";"CNPJ";"codprojeto"
1;"CLIENTE 1";"00000000000123";1234
2;"CLIENTE 1";"00000000000145";5678
3;"CLIENTE 1";"00000000000167";9012
4;"CLIENTE 1";"00000000000189";3456
5;"CLIENTE 1";"00000000000101";7890
6;"CLIENTE 1";"00000000000122";11
7;"CLIENTE 1";"00000000000133";22
8;"CLIENTE 1";"00000000000144";33
9;"CLIENTE 9";"00000000000155";44
10;"CLIENTE 10";"00000000000166";55

The original DF and DF2 have respectively 635939 and 1054 lines.

For cases that there is no CNPJ compatible, leave as 0 the codprojeto.

How can I fix this?

  • Marlos, good afternoon! Can you make the data set available? Hug!

  • Without the data it is difficult to answer, but I tell you that 1) This solution is extremely slow, use np.where instead of nested loops O(n 2); 2) On the last line the correct is = and not ==; 3) The error is possibly occurring pq 4 is not in the index of df or df2

  • Includes the dataset in the question. About the '==' in the last line, I have already made the correction. About Keyerror 4, it’s strange, because I’m not looking for value 4, but comparing between 2 separate values.

  • A tip to always get objective answers that meet you is to disbonibilize samples of data that contemplate all the problems you are going through. With the data you disbonibilizou you received 2 different responses that apitaram 2 different errors. There is a unique topic showing how to create a minimum, complete and verifiable example

  • Easy, thanks for the tip, Terry.

2 answers

2


My suggestion is to make a LEFT JOIN using the function merge(). And for it to work properly it is not necessary to create the column codproject with zeros in Dataframe 'DF'

# aqui vou excluir a coluna 'codprojeto' de df, 
# mas acredito que no seu projeto seja necessário 
# apenas comentar a linha onde ela é criada
df = df.drop(columns= ['codprojeto'])

df = df.merge(df2[['CNPJ', 'codprojeto']], on = 'CNPJ', how = 'left').fillna(0)
#para não ter valores duplicados use
#df = df.merge(df2[['CNPJ', 'codprojeto']].drop_duplicates('CNPJ'), on = 'CNPJ', how = 'left').fillna(0)

#para converter a nova coluna para int
df['codprojeto'] = df['codprojeto'].astype('int64')

#saida:
    CNPJ    DATA                    codprojeto
0   123     2020-12-02 00:00:00 UTC 1234.0
1   999...  2020-12-02 00:00:00 UTC 0.0
2   123     2020-12-02 00:00:00 UTC 1234.0
3   123     2020-12-02 00:00:00 UTC 1234.0
4   145     2020-12-02 00:00:00 UTC 5678.0
5   123     2020-12-02 00:00:00 UTC 1234.0
6   167     2020-12-02 00:00:00 UTC 9012.0
7   167     2020-12-02 00:00:00 UTC 9012.0
8   167     2020-12-02 00:00:00 UTC 9012.0
9   167     2020-12-02 00:00:00 UTC 9012.0
10  101     2020-12-02 00:00:00 UTC 7890.0
11  122     2020-12-02 00:00:00 UTC 11.0
12  144     2020-12-02 00:00:00 UTC 33.0
13  123     2020-12-02 00:00:00 UTC 1234.0
14  155     2020-12-02 00:00:00 UTC 44.0
15  155     2020-12-02 00:00:00 UTC 44.0
16  155     2020-12-02 00:00:00 UTC 44.0
17  166     2020-12-02 00:00:00 UTC 55.0
18  999...  2020-12-02 00:00:00 UTC 0.0
19  999...  2020-12-02 00:00:00 UTC 0.0
  • Same question, Terry: column codprojeto in df gets zeroed.

  • I believe that the CNPJ column type is not the same in both dataframes. Try to convert this column to number using the command : df['CNPJ'] = df['CNPJ'].astype('int64'). Do the same in df2 and then merge

  • Terry, then the error happens: Valueerror: invalid literal for int() with base 10: '00.000.000'

  • Apparently in one of its Dataframes the CNPJ field has points, which is a problem to convert to the numeric type. Search or create a new question in the OS-PT on how to turn numbers with points to int64 into pandas, giving examples and how to end up

  • Terry, it worked! They had three test lines with '00.000.000'. When I removed them, it spun. However, it stored as a float (including in your example above). Do you want to convert to int? In addition, 14,773 more lines have been added to the dataframe, while waiting to keep 636,042 records.

  • if there is no null value to convert to int, this way: df['codprojeto'] = df['codprojeto'].astype('int64')

  • About increasing the number of lines, enters what I commented on the @imonferrari response, I suggested using the drop_duplicates along with the mapand you said you couldn’t use it because each CNPJ had unique information inside DF2

  • I edited my answer with commands showing how not to have Duplicates data and how to convert int to column codprojeto

  • I didn’t understand how to use the drop inside the map. Could you explain to me, please?

  • I put an example in my answer that avoids creating extra line, just take a look at the update

  • 1

    Terry, it worked! That’s exactly what I needed. Thank you so much!!!

Show 6 more comments

1

Importing the pandas

import pandas as pd

Loading the test files

df = pd.read_csv('./df.csv')
df2 = pd.read_csv('./df2.csv', sep = ';')

df

    CNPJ        DATA              codprojeto
0   123  2020-12-02 00:00:00 UTC      0
1   123  2020-12-02 00:00:00 UTC      0
2   123  2020-12-02 00:00:00 UTC      0
3   123  2020-12-02 00:00:00 UTC      0
4   145  2020-12-02 00:00:00 UTC      0
5   123  2020-12-02 00:00:00 UTC      0
6   167  2020-12-02 00:00:00 UTC      0
7   167  2020-12-02 00:00:00 UTC      0
8   167  2020-12-02 00:00:00 UTC      0
9   167  2020-12-02 00:00:00 UTC      0
10  101  2020-12-02 00:00:00 UTC      0
11  122  2020-12-02 00:00:00 UTC      0
12  144  2020-12-02 00:00:00 UTC      0
13  123  2020-12-02 00:00:00 UTC      0
14  155  2020-12-02 00:00:00 UTC      0
15  155  2020-12-02 00:00:00 UTC      0
16  155  2020-12-02 00:00:00 UTC      0
17  166  2020-12-02 00:00:00 UTC      0
18  123  2020-12-02 00:00:00 UTC      0
19  123  2020-12-02 00:00:00 UTC      0

df2

    codcliente  nome        CNPJ    codprojeto
0   1           CLIENTE 1   123     1234
1   2           CLIENTE 1   145     5678
2   3           CLIENTE 1   167     9012
3   4           CLIENTE 1   189     3456
4   5           CLIENTE 1   101     7890
5   6           CLIENTE 1   122       11
6   7           CLIENTE 1   133       22
7   8           CLIENTE 1   144       33
8   9           CLIENTE 9   155       44
9   10          CLIENTE 10  166       55

Mapping the df Series to assign values to the match in df2

df['codprojeto'] = df['CNPJ'].map(df2.set_index('CNPJ')['codprojeto']).fillna(0)

Printing

df

Exit:

    CNPJ             DATA           codprojeto
0   123   2020-12-02 00:00:00 UTC   1234
1   123   2020-12-02 00:00:00 UTC   1234
2   123   2020-12-02 00:00:00 UTC   1234
3   123   2020-12-02 00:00:00 UTC   1234
4   145   2020-12-02 00:00:00 UTC   5678
5   123   2020-12-02 00:00:00 UTC   1234
6   167   2020-12-02 00:00:00 UTC   9012
7   167   2020-12-02 00:00:00 UTC   9012
8   167   2020-12-02 00:00:00 UTC   9012
9   167   2020-12-02 00:00:00 UTC   9012
10  101   2020-12-02 00:00:00 UTC   7890
11  122   2020-12-02 00:00:00 UTC     11
...
  • Congratulations on the solution. Can you explain what the map function is doing in this case? For me it is not obvious

  • Here it is, Imonferrari: df['codprojeto'] = df2['CNPJ_EMPRESA'].map(df.set_index('CNPJ_EMPRESA')['codprojeto']).fillna(0) Returns the following: ;InvalidIndexError: Reindexing only valid with uniquely valued Index objects

  • It ran, but filled the df "codproject" column with "0.0", instead of inserting the df2 "codproject".

  • It’s as if you find nothing and fill the entire column with 0 instead of the value in df2['codprojeto'].

  • Then fill the codproject column with zeros... codproject 0.0 0.0 0.0 0.0 0.0

  • Yes, I see it works for you, but here you keep filling in zeros. Is it because CNPJ, in df is int64 and in df2 as Object, gets lost when it comes to mapping? That is, in a dataset is a number, in the other a string.

  • It is a possibility, here they are already converted into int64 by own pandas. You can test by making df2['CNPJ'] = df2['CNPJ'].astype('int64') when loading the data frame.

  • Valueerror: invalid literal for int() with base 10: '00.000.000'

  • Send me your entire 2 datasets. Host them on google drive and paste the link here.

  • Imon, it’s internal data, I can’t put it publicly. Is there any way we can contact you privately?

  • I sent an e-mail.

  • From the error message it seems to me that using one df.drop_duplicates('CNPJ_EMPRESA') before setting the column as index (inside the .map) resolve, you can test?

  • Terry, I can’t drop the Cnpjs, because each line has unique information, no matter how much the CNPJ repeats itself.

Show 9 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.