How to generate an ID to identify a student in a database with enrollments of several years?

Asked

Viewed 76 times

1

I have the following basis of dice that contains enrollments of an institution that is developing a research, in the future to make a mapping of students I need an id that identifies each student, where with this id I can identify the student the following year, since I have data from 2009 to 2018.

To create this id I am using the columns with the attributes of students that does not change regardless of the year, which in case are TP_COR_RACA, TP_SEXO, NU_ANO_NASCIMENTO, NU_MES_NASCIMENTO and NU_DIA_NASCIMENTO, the idea is that from these attributes I create a id that repeats whenever that combination is the same.

I tried to do it this way:

def getKey(row):
    # retorna um str com os valores das colunas que são imutaveis independente do ano
    # que será usado como chave para gerar o Id.
    key = f"{row['TP_COR_RACA']}{row['TP_SEXO']}{row['NU_ANO_NASCIMENTO']}\
    {row['NU_MES_NASCIMENTO']}{row['NU_DIA_NASCIMENTO']}"

    return key

def getDicId(df):
    # retorna um dicionário em que os index são as chaves, sem repetir.
    dic_id = {}
    for index, row in df.iterrows():
        key = getKey(row)
        dic_id[key] = 'null'

    # cria um id para cada chave do dicionario.
    idx = 1
    for i in dic_id:
        dic_id[i] = idx
        idx += 1

    return dic_id
 

def getId(df):
    df.loc[df_concat.TP_SEXO == 2, 'TP_SEXO'] = 0
    dic_id = getDicId(df)
    df['Id'] = 'null'
    # Obtem uma chave de cada matricula e atribui sue id usando  o 
    # dicionario que tem o index igual sua chave.
    for index, row in df.iterrows():
        key = getKey(row)
        df.loc[index, 'Id'] = dic_id[key]
        
    return df

But it didn’t work very well, the same id was generated for different students even using these attributes as key contrary to what I expected, the result can be checked here, if anyone knows where the mistake is or how else I can do it I appreciate it.

  • Instead of creating a ID it might be more convenient to use the CPF of the student, since, by its nature, it is non-transferable and unalterable. This way, you can access the student’s data, many years after the same has left the educational institution.

  • It would be a good but does not have the CPF in the database

  • Important you [Dit] your question by providing a [mcve] problem that makes it clear what went wrong in the solution attempt. To better enjoy the site, understand and avoid closures and negativities worth reading the Stack Overflow Survival Guide in English.

  • @Bacco gave an improved question

1 answer

1


Imports:

import pandas as pd

Loading the data:

df = pd.read_csv('https://raw.githubusercontent.com/Renatolopo62/IFNMG-Januaria/master/dados/dados_padronizados_matriculas_januaria_2009_2018.csv')

Converting all data frame to string type to be able to manipulate:

df1 = df.applymap(str)

Selecting which columns will generate the new ID column:

df1['ID']  = df1[['TP_COR_RACA', 
                 'TP_SEXO', 
                 'NU_ANO_NASCIMENTO',
                 'NU_MES_NASCIMENTO',
                 'NU_DIA_NASCIMENTO']].agg(''.join, axis=1) 

Showing the ID column

df1['ID']

Exit:

0        311990613
1         60198987
2        601990130
3        611990119
4        101981816
           ...    
10982    321999117
10983     11199996
10984    321991615
10985    312000426
10986    321998121
  • 1

    Very well! I managed to generate this way

Browser other questions tagged

You are not signed in. Login or sign up in order to post.