How to create class correctly with pandas by applying methods?

Question

How to create class correctly with pandas by applying methods?

Asked 4 years, 10 months ago

Viewed 174 times

0

I have a 'data.csv' file, with the data below:

turma,nome,code,motivo,atividade,trofeus,data
9º Ano Fundamental A,Maria Joana,9X4YK,Realizar atividade Astromaker,Lição A,3,21/02/2020 11:44:11
9º Ano Fundamental A,Maria Joana,9X4YK,Realizar atividade Astromaker,Lição B,3,28/02/2020 11:46:49
9º Ano Fundamental A,Maria Joana,9X4YK,Realizar atividade Astromaker,Lição B,3,06/03/2020 11:31:43
9º Ano Fundamental A,José Antonio,9XV62,Realizar atividade Astromaker,Lição B,3,14/02/2020 12:28:55

I created a class to read the csv file:

import pandas as pd


class DataFrame(object):
    def __init__(self, name_file):
        self.name_file = name_file
        self.df = self.read_file()
        return self.df

    def read_file(self):
        try:
            self.df = pd.read_csv(self.name_file)
        except IndexError:
            print('Erro: nome de arquivo incorreto')
        return self.df

Below I include in the same class, functions to make filters and groupings, for example:

    def soma_trofeus_aluno(self):
        self.df = self.df.groupby(['turma', 'nome', 'code'])['trofeus'].sum().reset_index()

    def filtro_aluno(self, df, aluno):
        self.aluno = ''
        self.df = df[df['nome'].str.contains(self.aluno)]
        return self.df

But none of the def I tried worked. I’m trying to call them that:

def main():
    dados = DataFrame('dados.csv')
    dados.filtro_aluno(dados, 'Maria')
    dados.exibir_df()



if __name__ == '__main__':
    main()

How can I pass my 'def' correctly and call them?

2 answers

1

The error happens because the class Dataframe is named after df, and in function pupil you defined one of the parameters with the same name. So, if the operator self is not used, a confusion will be generated about which variable you are referring to. What’s more, the parameter variable df of this function is an object of the Dataframe class you created, and so it is obviously not possible to stream it. To solve the problem, the line

self.df = pd.DataFrame(self.df[self.df['nome'].str.contains(self.aluno)])

It should be replaced by

self.df = pd.Dataframe(self.df[self.df['name'].str.contains(self.student)])

if you want to keep the parameter df function. Otherwise, just remove it and you can remove the self's preceding the references to the attribute df.

The complete code looks like this (I implemented the function myself displa_df, change it according to your needs):

import pandas as pd


class DataFrame(object):
    def __init__(self, name_file):
        self.name_file = name_file
        self.df = self.read_file()

    def read_file(self):
        try:
            self.df = pd.read_csv(self.name_file)
        except IndexError:
            print('Erro: nome de arquivo incorreto')
        return self.df
    
    def soma_trofeus_aluno(self):
        self.df = self.df.groupby(['turma', 'nome', 'code'])['trofeus'].sum().reset_index()

    def filtro_aluno(self, aluno):
        self.aluno = ''
        self.df = pd.DataFrame(self.df[self.df['nome'].str.contains(self.aluno)])
        return self.df
    
    def exibir_df(self):
        print (self.df)
    
def main():
    dados = DataFrame('dadoscsv')
    dados.filtro_aluno(dados, 'Maria')
    dados.exibir_df()



if __name__ == '__main__':
    main()

Good, it helped a lot. I just deleted the 'data' in 'data.filtro_student(data, 'Maria')'. Very show!!!

– Rony Deikson Santana

2020/09/09 at 19:57

Browser other questions tagged python oop classes pandas

You are not signed in. Login or sign up in order to post.

by hkotsubo • **55,826** points · Answer 1 · 2020-09-09T16:42:07+00:00

__init__ is the constructor and it serves to initialize the instance being created. You should not return anything from it, it makes no sense. It should only do what is necessary to create a valid instance (for example, read the file and create the dataframe):

def __init__(self, name_file):
    self.df = pd.read_csv(name_file)

That’s it (I eliminated the method read_file, seemed redundant to me). If it is wrong, it will launch the exception and the instance will not be created (after all, it makes sense to create the DataFrame if the file is invalid? ) - read more about here.

I didn’t keep the file name on self.name_file because you don’t use it for anything else afterwards. If the file name is not part of the class (and is only used to read the file), do not store it in a field of your own. And it seems to me that it is not necessary, because what matters is the data of the file (the self.df). Don’t keep what will no longer be used.

Once created the field self.df, you can use it in the other methods. But there is a detail, you are doing this:

self.df = faz algo...

I mean, every time you call soma_trofeus_aluno or filtro_aluno, will change the dataframe class. Does it make sense to change the original data? I guess not: suppose you uploaded the whole file (ie, self.df contains all the data). Then you call filtro_aluno and filter by "Maria". When doing self.df = resultado do filtro, the field self.df will only have Maria’s records (the other data will be lost). So in this case it would make sense to return the result in another DataFrame, something like that:

import pandas as pd

class DataFrame:
    # construtor recebe o nome do arquivo, ou o dataframe já criado
    def __init__(self, name_file = None, df = None):
        if name_file is not None:
            self.df = pd.read_csv(name_file)
        elif df is not None:
            self.df = df

    def soma_trofeus_aluno(self):
        # retorna outro DataFrame
        return DataFrame(df=self.df.groupby(['turma', 'nome', 'code'])['trofeus'].sum().reset_index())

    def filtro_aluno(self, aluno):
        # retorna outro DataFrame
        return DataFrame(df=self.df[self.df['nome'].str.contains(aluno)])

    def exibir_df(self):
        print(self.df)

def main():
    dados = DataFrame(name_file='dados.csv')
    # filtro é outro DataFrame só com os dados da Maria
    filtro = dados.filtro_aluno('Maria')
    filtro.exibir_df()
    # soma é outro DataFrame só com as somas dos alunos
    soma = dados.soma_trofeus_aluno()
    soma.exibir_df()
    # DataFrame original ainda tem todos os registros
    dados.exibir_df()

That said, did I really need a class for all of this? Why not create the dataframes separately, and have functions to manipulate them?

import pandas as pd

def read_df(name_file):
    return pd.read_csv(name_file)

def soma_trofeus_aluno(df):
    # retorna outro dataframe
    return df.groupby(['turma', 'nome', 'code'])['trofeus'].sum().reset_index()

def filtro_aluno(df, aluno):
    # retorna outro dataframe
    return df[df['nome'].str.contains(aluno)]

def main():
    dados = read_df(name_file='dados.csv')
    # filtro é outro dataframe só com os dados da Maria
    filtro = filtro_aluno(dados, 'Maria')
    print(filtro)
    # soma é outro dataframe só com as somas dos alunos
    soma = soma_trofeus_aluno(dados)
    print(soma)
    # dataframe original ainda tem todos os registros
    print(dados)