Optimization through dataframe (Pandas)

Asked

Viewed 116 times

2

I need to compare two. csv files for inconsistencies. The boleto.txt file contains information about the boletos issued by a company. This file has 500,000 lines. The.txt file contains information of the items included in each boleto. This file has 1.2 million lines.

I need to verify that the sum of the value of the items in the.txt file is equivalent to the value of the boleto in the.txt file.

I made the following code in python:

import numpy as np
import pandas as pd

#lendo o arquivo boleto.txt
boleto = pd.read_csv("""C:/boleto.txt""", header = None,delimiter='\t',encoding = 'ISO-8859-1')
boleto.columns = ['sigla','unidade','numero','dt_vencimento','valor','dt_pagamento','valor_pago','dt_credito','reembolso','status','abonado','inativo','nao_contabil','pessoa','custas']

#lendo o arquivo lancamentos.txt
lancamentos = pd.read_csv("""C:/lancamentos.txt""", header = None,delimiter='\t',encoding = 'ISO-8859-1')
lancamentos.columns = ['sigla','unidade','numero','dt_vencimento','dt_credito','valor','valor_pago','destinacao','desconto','conta','desconhecido']

#percorrendo boleto por boleto
for row in boleto.index:
    #definindo as condições
    cond1 = (lancamentos['sigla'] == boleto.iloc[row]["sigla"])
    cond2 = (lancamentos['unidade'] == boleto.iloc[row]["unidade"])
    cond3 = (lancamentos['numero'] == boleto.iloc[row]["numero"])
    cond4 = (lancamentos['dt_vencimento'] == boleto.iloc[row]["dt_vencimento"])
    
    #fazendo os filtros para pegar os lançamento referentes ao boleto
    resultado = lancamentos.loc[(cond1 & cond2 & cond3 & cond4)]
    
    #caso o valor do boleto seja diferente da soma dos valores dos lançamentos
    if boleto.iloc[row]["valor"] != resultado['valor'].sum().round(decimals = 2):
        print('diferente')
        print(boleto.iloc[row]["valor"])
        print(resultado['valor'].sum())

This code even works, only it’s taking horrors to run. Is there any way to rewrite it to make it faster?

  • 1

    Gugax, good night! Provide test data and an example of expected result, so people can help you more easily. Hug!

2 answers

1


Based on the question below:

I need to verify that the sum of the value of the items in the.txt file is equivalent to the value of the boleto in the.txt file.

Use the tube gropby and sum() together

Example:

>>> import pandas as pd

>>> df = pd.DataFrame({"A": [1.1, 1.2, 1.3, 1.4], "B": ["banana", "abacaxi", "abacaxi", "banana"]})

>>> df
     A        B
0  1.1   banana
1  1.2  abacaxi
2  1.3  abacaxi
3  1.4   banana

>>> df.groupby(["B"]).sum()
           A
B
abacaxi  2.5
banana   2.5

You can do the groupby by more than one column, adding the same in the list as the groupby takes as parameter.

The groupby & sum result can be assigned to a new Dataframe.

Then just work with him.

I hope it helps

1

Dude, there’s a really good library for big but little-talked-about Dataframes, it’s called dask.

import dask import dask.dataframe as dd

Most pandas operations are reproducible in dask.

Turns on the time difference:

inserir a descrição da imagem aqui

inserir a descrição da imagem aqui

Image source

Briefly, it operates in parallel, where each operation splits into smaller N Numpy files. I confess that I do not know exactly how it works, because I have been reading your documentation for a long time. However, I believe it solves the problem you are facing regarding the delay of execution.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.