Create Dataframe from lists with different sizes, keeping the data matching

Asked

Viewed 59 times

0

I have lists of different lengths with data extracted from a PDF.

Each value and code is part (is inside) of a respective note:

cliente = ['12345','15432']
nota = ['21/0576750-3', '20/063859-1']
codigo = [86, 6052,5031,1038,1025]
valor = [100.45, 200.34, 450.10, 150.58, 42.30]

I want to turn this data into a dataframe like this:

inserir a descrição da imagem aqui

Note: The challenge is to be able to create relationships one for many in order to keep the correspondence between the data of the lines.

I tried the following code, but the table loses the reference of the line data.

tabela = list(zip(cliente,nota,codigo,valor)
df = pd.DataFrame(tabela, columns = ['Cliente', 'Nota', 'Código','Valor'])
df.to_excel('d:/nome.xlsx', index=False)

Could anyone help? Thank you!

  • Can put the sample pdf?

  • 1

    Imonferrari, man, I managed to solve with the solution I posted below. Thank you so much for your attention!

2 answers

2


Guys, I was able to solve the issue of creating Dataframe without losing the indexing of column information!

The solution was to use the Collections - namedtuple library

This function allows the creation of a tuple with items that work like a dictionary, maintaining the relationship between the keys and the extracted values.

Follow the code below:

import pandas as pd
import re
from collections import namedtuple


formulario = namedtuple('Formulário', 'cliente nota codigo valor')


padraoCliente = re.compile('[0-9]{2}.[0-9]{3}.[0-9]{3}\/[0-9]{4}-[0-9]{2}')
padraoNota = re.compile('Número da nota')
padraoCodigoValor = re.compile('([0-9]{4})\s(\.?[0-9]*\.?[0-9]*\.?[0-9]*,[0-9]{2})')

extracao = ocr.getTexto('arquivo.pdf')

linha_items = []

for linha in extracao.split('\n'):
    if padraoCliente.match(linha):
        cliente = linha.split()
        cliente = ' '.join(cliente)

    if padraoNota.match(linha):
        nota = linha.split()
        

    linha = padraoCodigoValor.search(linha)
    if linha:
        codigo = linha.group(1)
        valor = linha.group(2)
        
        linha_items.append(formulario(cliente nota codigo valor))

df = df = pd.DataFrame(linha_items, columns=['Número do cliente','Número da Nota', 'Código do produto', 'Valor do Produto'])

I learned the procedure by watching this video: Youtube: Use Python to Extract Invoice Lines from a semistructured PDF AP Report

1

you can do this with the following code:

pd.DataFrame({'Cliente': pd.Series(cliente), 'Nota': pd.Series(nota), 'Código': pd.Series(codigo), 'Valor': pd.Series(valor)})

Exit:

  Cliente          Nota  Código   Valor
0   12345  21/0576750-3      86  100.45
1   15432   20/063859-1    6052  200.34
2     NaN           NaN    5031  450.10
3     NaN           NaN    1038  150.58
4     NaN           NaN    1025   42.30

The only "problem" is that you will have to see how it will handle the repetition of the client code and the note.

  • 1

    Evilmaax, thanks for the idea! Very interesting to use Series dictionaries in the value fields! My problem is really keeping the client’s link and note with the respective codes. It would be something like a one-for-many relationship that I still haven’t been able to do in extraction.

  • 1

    Man, I thought of something pretty rudimentary for that. You could scroll through the Note column and when you found any repetition in the current row with the previous one, you would add the client value of the previous line in the current one. You can do it with for and if. Surely there is a more Pythonica solution, but it will solve the problem

  • 1

    Man, I managed to solve with the solution I posted above. Thank you so much for your help!!! You gave me two great ideas to use in my next code.

  • 1

    Boooa @Re_moreira Hit, bro!

Browser other questions tagged

You are not signed in. Login or sign up in order to post.