Cross-reference two different dataframes with different line numbers

Question

Cross-reference two different dataframes with different line numbers

Asked 3 years, 7 months ago

Viewed 150 times

-1

I have the following dataset

df.head()

In the column Education degree, I have the values of the indices of each education degree, shown in the table below:

Grau_Instr_Bibl = {'Categoria': ['Analfabeto', 'Até 5ª Incompleto', '5ª Completo Fundamental', '6ª a 9ª Fundamental', 'Fundamental Completo', 'Médio Incompleto', 'Médio Completo', 'Superior Incompleto', 'Superior Completo', 'MESTRADO', 'DOUTORADO', 'IGNORADO'],
                   'Valores na fonte': ['1','2','3','4','5','6','7','8','9','10','11','-1']
                  }
Grau_Instr_Bibli = pd.DataFrame(data=Grau_Instr_Bibl)
Grau_Instr_Bibli

I would like to replace the "Values in source" by the category name in a new column.

I tried to make a for, but unsuccessful:

for i in range(0, df.shape[0]):
   if df.at[i, "Grau Instrução"] == Grau_Instr_Bibli['Valores na fonte']:
     df['GrauNovo'] = df.append(df['Grau Instrução'].loc[[i]])

There is an easier way to compare two datasets with different line numbers or with only one loop for?

2 answers

2

You can solve this using the function .map() passing a Series to her, this way:

dataset_original = pd.DataFrame({'Grau de instrucao': ['1','2','3','4','5','6','7','8','9','10','11','-1']})

s = Grau_Instr_Bibli.set_index('Valores na fonte')['Categoria']

dataset_original['GrauNovo'] = dataset_original['Grau de instrucao'].map(s)

#saida do dataset_original
    Grau de instrucao   GrauNovo
0   1                   Analfabeto
1   2                   Até 5ª Incompleto
2   3                   5ª Completo Fundamental
3   4                   6ª a 9ª Fundamental
4   5                   Fundamental Completo
5   6                   Médio Incompleto
6   7                   Médio Completo
7   8                   Superior Incompleto
8   9                   Superior Completo
9   10                  MESTRADO
10  11                  DOUTORADO
11  -1                  IGNORADO

1

Thank you so much for your help! But I had to modify it to fit my problem, like the column df[Grau Instrução] of the original dataframe was type int64, I had to adapt the idea of s, as follows: s = {1: 'Analfabeto', 2: 'Até 5ª Incompleto', 3: '5ª Completo Fundamental', 4: '6ª a 9ª Fundamental', 5: 'Fundamental Completo', 6: 'Médio Incompleto', 7: 'Médio Completo', 8: 'Superior Incompleto', 9: 'Superior Completo', 10: 'MESTRADO', 11: 'DOUTORADO', -1: 'IGNORADO'} placing the indices as int, inside the variable s this way the method worked! Thank you to all!

– Luis Henrique Batista

2020/11/17 at 01:08

Browser other questions tagged python pandas

You are not signed in. Login or sign up in order to post.

by lmonferrari • **3,550** points · Answer 1 · 2020-11-13T10:47:26+00:00

You can use replace with a dictionary.

Importing the package:

import pandas as pd

Creating the first data frame:

Grau_Instr_Bibl = {'Categoria': ['Analfabeto', 'Até 5ª Incompleto', '5ª Completo Fundamental', '6ª a 9ª Fundamental', 'Fundamental Completo', 'Médio Incompleto', 'Médio Completo', 'Superior Incompleto', 'Superior Completo', 'MESTRADO', 'DOUTORADO', 'IGNORADO'],
                   'Valores na fonte': ['1','2','3','4','5','6','7','8','9','10','11','-1']
                  }
Grau_Instr_Bibli = pd.DataFrame(data=Grau_Instr_Bibl)

Simulating your source file '':

fonte = pd.DataFrame({'Valores na fonte': ['1','2','3','4','5','6','7','8','9','10','11','-1']})

Here we create the dictionary with key and values of Grau_Instr_Bibli, and replaced in fonte:

fonte.replace(Grau_Instr_Bibli.set_index('Valores na fonte').to_dict()['Categoria'], inplace = True)

Entree:

    Valores na fonte
0   1
1   2
2   3
3   4
4   5
5   6
6   7
7   8
8   9
9   10
10  11
11  -1

Exit:

    Valores na fonte
0   Analfabeto
1   Até 5ª Incompleto
2   5ª Completo Fundamental
3   6ª a 9ª Fundamental
4   Fundamental Completo
5   Médio Incompleto
6   Médio Completo
7   Superior Incompleto
8   Superior Completo
9   MESTRADO
10  DOUTORADO
11  IGNORADO

Creating a new column:

fonte['Nova coluna'] = fonte.replace(Grau_Instr_Bibli.set_index('Valores na fonte').to_dict()['Categoria'])

Note that I removed the inplace = True

Exit:

    Valores na fonte    Nova coluna
0           1           Analfabeto
1           2           Até 5ª Incompleto
2           3           5ª Completo Fundamental
3           4           6ª a 9ª Fundamental
4           5           Fundamental Completo
5           6           Médio Incompleto
6           7           Médio Completo
7           8           Superior Incompleto
8           9           Superior Completo
9          10           MESTRADO
10         11           DOUTORADO
11         -1           IGNORADO

Complete code:

import pandas as pd

Grau_Instr_Bibl = {'Categoria': ['Analfabeto', 'Até 5ª Incompleto', '5ª Completo Fundamental', '6ª a 9ª Fundamental', 'Fundamental Completo', 'Médio Incompleto', 'Médio Completo', 'Superior Incompleto', 'Superior Completo', 'MESTRADO', 'DOUTORADO', 'IGNORADO'],
                   'Valores na fonte': ['1','2','3','4','5','6','7','8','9','10','11','-1']}

Grau_Instr_Bibli = pd.DataFrame(data=Grau_Instr_Bibl)

fonte = pd.DataFrame({'Valores na fonte': ['1','2','3','4','5','6','7','8','9','10','11','-1']})

fonte['Nova coluna'] = fonte.replace(Grau_Instr_Bibli.set_index('Valores na fonte').to_dict()['Categoria'])