Find values in blocks that extend across multiple lines

Asked

Viewed 58 times

1

I have a Arquivo Espelho that contains coupon mirrors, and I created an algorithm to separate the coupons:

import re

txt = open("arqEspelho.txt", 'r+').read()

x = re.finditer(r".*COTIA\s*C*", txt)

z = re.finditer(r"OPR.*", txt)

espelhos = list(zip(x, z))

for espelho in espelhos:
    txt_espelho = txt[espelho[0].span()[0]: espelho[1].span()[1] + 1]

That one txt_espelho would be the coupon block or be the separate coupon, however I needed to separate it by value and extract , has a line that is exactly like this:

<N>                     Extrato No. 042356</N>

I tried to find this way:

re.findall(r'Extrato No. 042356', txt_espelho):

But it still returns me random coupons, also has the value line, which is like this:

TOTAL R$                                                717,30

And I tried to find her like this:

re.findall(r"TOTAL\sR.*717,30", txt_espelho)

But he behaves as if there were no such value in txt_espelho being that you have in the variable x I define the inicio of the coupon , and in the variable z I define the fim coupon.

But sometimes he finds the word "COTIA" in other coupon places, so I tried to set the start like this:

x = re.finditer(r".*CENTRO", txt)

Which refers to the word center that comes before Cotia to avoid mistakes, but still he does not find this word.

In short, I need to look for the coupon by value and extract in those coupon blocks.


File example:

                       COTIA CENTRO
                        ATACADAO S.A.
                      PROF JOSE BARRETO
--------------------------------------------------------------
CNPJ 75.315.333/0059-25
IE 278.157.726.114
IM ISENTO
--------------------------------------------------------------
<N>                     Extrato No. 042353</N>
<N>                CUPOM FISCAL ELETRÔNICO - SAT</N>
--------------------------------------------------------------
#|COD|DESC|QTD|UN|VL UN R$|(VL TR R$)*|VL ITEM R$
--------------------------------------------------------------
001 00025200 SAB.DOVE BRANCO         1X90G 
     1 UND9 X 2,39 (0,94)                                 2,39
002 00025200 SAB.DOVE BRANCO         1X90G 
     1 UND9 X 2,39 (0,94)                                 2,39
003 00061325 VINHO SANGUE BOI        1X1LT 
     1 UND9 X 11,90 (4,17)                               11,90
004 00004940 OLEO SOJA CONCORDIA   1X900ML 
     1 UND9 X 6,59 (0,74)                                 6,59
005 00048794 ESC.DENTAL DENTRAT A   1X1UND 
     1 UND9 X 2,25 (0,96)                                 2,25
006 00064376 HASTES FLEXIVEIS COT 1X150UND 
     1 UND9 X 2,29 (0,51)                                 2,29
007 00058824 SHAMP.DARLING         1X350ML 
     1 UND9 X 4,90 (2,20)                                 4,90
008 00004274 CAFE CABOCLO ALMOF.    1X500G 
     1 PCT8 X 6,88 (0,77)                                 6,88
009 00050143 REFR.PO TANG            1X25G 
     2 UND9 X 0,99 (0,48)                                 1,98
010 00050140 REFR.PO TANG            1X25G 
     1 UND9 X 0,99 (0,24)                                 0,99
011 00050144 REFR.PO TANG            1X25G 
     1 UND9 X 0,99 (0,24)                                 0,99
012 00050145 REFR.PO TANG            1X25G 
     1 UND9 X 0,99 (0,24)                                 0,99


Total bruto de Itens                                     44,54
<N>TOTAL R$                                                 44,54</N>


Dinheiro                                                 50,00

Troco R$                                                  5,46

--------------------------------------------------------------
OBSERVACOES DO CONTRIBUINTE


*Valor aproximado dos tributos do item
Valor aproximado dos tributos deste cupom
(conforme Lei Fed.12.741/2012) R$                        12,45
Vlr.Aprox.Tributos: Federal R$5,41 (12,15%) 
Vlr.Aprox.Tributos: Estadual R$7,04 (15,81%) 
Fonte: IBPT.
--------------------------------------------------------------
<N>                      SAT No. 000895390</N>
                    01/02/2021 - 07:30:50


<N>               3521 0275 3153 3300 5925 5900 </N>
<N>                  0895 3900 4235 3985 7588 </N>
CFe35210275315333005925590008953900423539857588|20210201073050|44.54||C7Vv8aEzrn2pHy6l0ldI4qbrPdlBfJ35VojuEoVTpblLmYZkerh7fZzDbLFIDdGc3ztTxM8ZaTWaF6veC3uKdy2A5a2ZiXhQZH62i3wn5PDR8rIPFGTJFmabD7GhkwOcNkPTGQKo/CW3x3ArPPjidX5cSl7O3yjWVKabD53OrAcn8HTLJsGSt/2hnHlf+RHcB9JEYC2IFQkOB9oWqlxifZUx+oUGtd3cTiad5ACHjexHh68xeYe+MPgNOECmaPPhaWq8/kgVAUZLsBnOdf3xefnU3+0NwBKujhZx3IsWbHRUFR1OPA8YFgcDGGwhJ0RtIw7wRi+dDtNNY31Cwa2o4A==
--------------------------------------------------------------
        TPLinux AT.14.c00X-18.06 - Unisys Brasil Ltda
--------------------------------------------------------------
EPSON TM-T20    VERSAO:10.02 ES    PDV:020    LJ:059
OPR:0008108Leidiana M                      01/02/2021 07:30:50
carro 



                        COTIA CENTRO
                        ATACADAO S.A.
                      PROF JOSE BARRETO
--------------------------------------------------------------
CNPJ 75.315.333/0059-25
IE 278.157.726.114
IM ISENTO
--------------------------------------------------------------
<N>                     Extrato No. 042354</N>
<N>                CUPOM FISCAL ELETRÔNICO - SAT</N>
--------------------------------------------------------------
#|COD|DESC|QTD|UN|VL UN R$|(VL TR R$)*|VL ITEM R$
--------------------------------------------------------------
001 00036102 COXA FGO SEARA IQF      1X1Kg 
     1 PCT9 X 8,98 (1,45)                                 8,98
002 00017122 ESC.D.SORRISO STD      1X1UND 
     1 UND9 X 2,35 (0,60)                                 2,35
003 00012075 CR.D.COLGATE MPA        1X90G 
     1 TBO9 X 2,14 (0,35)                                 2,14
004 00057464 SABAO PDC YPE COCO     1X200G 
     1 UND9 X 2,35 (0,52)                                 2,35
005 00033822 BISC.DUCHEN CR.CRACK   1X200G 
     1 UND8 X 1,49 (0,17)                                 1,49
006 00066640 MAC.PREDILLETO COMUM   1X500G 
     1 UND9 X 1,89 (0,21)                                 1,89
007 00066640 MAC.PREDILLETO COMUM   1X500G 
     1 UND9 X 1,89 (0,21)                                 1,89
008 00061018 SAB.PROTEX              1X85G 
     1 UND9 X 2,10 (0,47)                                 2,10


Total bruto de Itens                                     23,19
<N>TOTAL R$                                                 23,19</N>


Dinheiro                                                 24,00

Troco R$                                                  0,81

--------------------------------------------------------------
OBSERVACOES DO CONTRIBUINTE


*Valor aproximado dos tributos do item
Valor aproximado dos tributos deste cupom
(conforme Lei Fed.12.741/2012) R$                         3,98
Vlr.Aprox.Tributos: Federal R$1,20 (5,17%) 
Vlr.Aprox.Tributos: Estadual R$2,78 (11,99%) 
Fonte: IBPT.
--------------------------------------------------------------
<N>                      SAT No. 000895390</N>
                    01/02/2021 - 07:41:52


<N>               3521 0275 3153 3300 5925 5900 </N>
<N>                  0895 3900 4235 4685 9540 </N>
CFe35210275315333005925590008953900423546859540|20210201074152|23.19||cLWbQszXKX3f89kmOQ3k1Te72502OiJPKuqgKyehwiApqxvS3Jli1JVnjiCgXHHPZChueR8XXB61nurhmBJ3f/55Mphd4pq0UVjdMR61n+9/UPzq1MYCz2I3M2+/UTWw3aa3rzy+Y/bpUa6wOBn60+F/clO8jNc22AVzASdl62NH/rI2883hQfCxy53r/ECRtxDjujNHMjZcLbsBwAFeXbFANZcA3c7PECxcBxBtDP8lfuPqSPjjEbGL587KWEApILMLZwviqXUvYB6dkj5OC6iEwPpTuhRyZnHaZfSZzB3+n1qwCZVOKu8uKqHuw3gtcE3k6Q98tZ0O827+TbTMjQ==
--------------------------------------------------------------
        TPLinux AT.14.c00X-18.06 - Unisys Brasil Ltda
--------------------------------------------------------------
EPSON TM-T20    VERSAO:10.02 ES    PDV:020    LJ:059
OPR:0008108Leidiana M                      01/02/2021 07:41:52
carro 



                        COTIA CENTRO
                        ATACADAO S.A.
                      PROF JOSE BARRETO
--------------------------------------------------------------
CNPJ 75.315.333/0059-25
IE 278.157.726.114
IM ISENTO
--------------------------------------------------------------
<N>                     Extrato No. 042355</N>
<N>                CUPOM FISCAL ELETRÔNICO - SAT</N>
--------------------------------------------------------------
#|COD|DESC|QTD|UN|VL UN R$|(VL TR R$)*|VL ITEM R$
--------------------------------------------------------------
001 00009580 COXA/SOB.FGO MR FGO     1X1Kg 
 0,794 KG9  X 6,90 (0,89)                                 5,48
002 00009580 COXA/SOB.FGO MR FGO     1X1Kg 
 0,710 KG9  X 6,90 (0,79)                                 4,90
003 00009580 COXA/SOB.FGO MR FGO     1X1Kg 
 0,680 KG9  X 6,90 (0,76)                                 4,69
004 00009580 COXA/SOB.FGO MR FGO     1X1Kg 
 0,856 KG9  X 6,90 (0,96)                                 5,91
005 00009580 COXA/SOB.FGO MR FGO     1X1Kg 
 0,782 KG9  X 6,90 (0,87)                                 5,40
006 00009580 COXA/SOB.FGO MR FGO     1X1Kg 
 0,786 KG9  X 6,90 (0,88)                                 5,42
007 00009580 COXA/SOB.FGO MR FGO     1X1Kg 
 0,674 KG9  X 6,90 (0,75)                                 4,65
008 00009580 COXA/SOB.FGO MR FGO     1X1Kg 
 0,576 KG9  X 6,90 (0,64)                                 3,97
009 00009580 COXA/SOB.FGO MR FGO     1X1Kg 
 0,754 KG9  X 6,90 (0,84)                                 5,20
010 00033738 FRANGO CONFINA CONG.    1X1Kg 
 2,614 KG9  X 6,50 (2,75)                                16,99
011 00033738 FRANGO CONFINA CONG.    1X1Kg 
 2,568 KG9  X 6,50 (2,70)                                16,69
012 00033738 FRANGO CONFINA CONG.    1X1Kg 
 2,390 KG9  X 6,50 (2,52)                                15,54
013 00033738 FRANGO CONFINA CONG.    1X1Kg 
 2,564 KG9  X 6,50 (2,70)                                16,67
014 00033738 FRANGO CONFINA CONG.    1X1Kg 
 2,142 KG9  X 6,50 (2,26)                                13,92


Total bruto de Itens                                    125,43
<N>TOTAL R$                                                125,43</N>


Dinheiro                                                130,00

Troco R$                                                  4,57

--------------------------------------------------------------
OBSERVACOES DO CONTRIBUINTE


*Valor aproximado dos tributos do item
Valor aproximado dos tributos deste cupom
(conforme Lei Fed.12.741/2012) R$                        20,32
Vlr.Aprox.Tributos: Federal R$5,27 (4,20%) 
Vlr.Aprox.Tributos: Estadual R$15,05 (12,00%) 
Fonte: IBPT.
--------------------------------------------------------------
<N>                      SAT No. 000895390</N>
                    01/02/2021 - 07:44:32


<N>               3521 0275 3153 3300 5925 5900 </N>
<N>                  0895 3900 4235 5353 0841 </N>
CFe35210275315333005925590008953900423553530841|20210201074432|125.43||MMZY3pEVZjxz7vN1sCZoKgaOsMj8NqDgi3UFhuve6eSaIGstqJvtFd4Ho4jucoMxl2uJ9mTNOKzRpeuYpXOYwGJSqVzubhpNw63YmyGv8j3Yzi+HW+TXnJANrP+cPNCmCpcRYPvaxyLF/ko1JkwIUNGBN550pLsXcmCVqxXqgRR51VaspD72t4Rt8V+3ORuyJrVd07sSfnqj2jOlsYUg01M9czd7TGiddYJXC8BOR/427xYxVV1DAVKk019YXxEus3ZsKsTGDpQ4jycuTRv3DsS8OWUIVbh9Nhp5jBBijeRH7T46UyrcsJcRYfxTgS0WzhrqA3l8EBSDKdnOdeNUnQ==
--------------------------------------------------------------
        TPLinux AT.14.c00X-18.06 - Unisys Brasil Ltda
--------------------------------------------------------------
EPSON TM-T20    VERSAO:10.02 ES    PDV:020    LJ:059
OPR:0008108Leidiana M                      01/02/2021 07:44:33
carro 

1 answer

2


I think you’re complicating for nothing.

Instead of sweeping the string looking for the beginning and end of each coupon, you can simply go reading line by line, and for each line you see if you are at the beginning or end of a coupon, or if you have an extract, or the total value, etc.

So you can even use regex to extract the parts that matter, but it gets simpler:

import re

re_extrato = re.compile(r'Extrato No. (\d+)')
re_total = re.compile(r'TOTAL R\$\s+(\d+,\d{2})')

cupons = [] # lista de cupons
with open("arqEspelho.txt", 'r') as arq:
    for linha in arq: # para cada linha do arquivo
        linha = linha.strip() # remove os espaços do início e fim, e as quebras de linha
        if linha == 'COTIA CENTRO': # início de um cupom
            # começa um novo cupom
            cupom = {}
        elif linha == 'carro': # final de um cupom
            cupons.append(cupom) # adiciona na lista de cupons
        else:
            # procura pelo extrato
            m = re_extrato.search(linha)
            if m: # se tem, seta o valor do extrato no cupom
                cupom['extrato'] = m.group(1)
            else: # senão, procura pelo total
                m = re_total.search(linha)
                if m: # se encontrou, pega o valor
                    cupom['total'] = m.group(1)

From Python 3.8 you can use assignment Expressions, that leaves the code a little more succinct:

# A partir do Python 3.8
import re

re_extrato = re.compile(r'Extrato No. (\d+)')
re_total = re.compile(r'TOTAL R\$\s+(\d+,\d{2})')

cupons = []
with open("arqEspelho.txt", 'r') as arq:
    for linha in arq:
        linha = linha.strip() # remove os espaços do início e fim, e as quebras de linha
        if linha == 'COTIA CENTRO':
            # começa um novo cupom
            cupom = {}
        elif m := re_extrato.search(linha): # assignment expression, somente para Python >= 3.8
            cupom['extrato'] = m.group(1) # se achou o extrato, seta no cupom
        elif m := re_total.search(linha): # assignment expression, somente para Python >= 3.8
            cupom['total'] = m.group(1) # se achou o total, seta no cupom
        elif linha == 'carro': # final de um cupom
            cupons.append(cupom) # adiciona na lista de cupons

In this case, the regex has a catch group (parentheses), which I can use later in the method group to get only the information I want (in this case, are the digits that correspond to the statement, or the total value).

The result will be the list cupons, in which each element is a dictionary containing the extract value and the total. Then you can use it to fetch the coupons using the criteria you want, for example:

# buscar cupom pelo valor total
for cupom in cupons:
    if cupom['total'] == '44,54':
        print(f'achei, extrato={cupom["extrato"]}')

In this case, the total is a string, but you can convert to number if you want to (I think it already escapes the scope of the question, but anyway, once you have the data, you can manipulate it as you see fit).

And to extract more data from each coupon, just put more conditions on if/elif and if so, create new expressions to extract what you need, and finally save this data in the dictionary cupom.


Still, if you want all the text of each coupon, just go incrementing it in the same loop:

cupons = []
texto_cupom = ''
with open("arqEspelho.txt", 'r') as arq:
    for linha in arq:
        texto_cupom += linha # texto do cupom
        linha_sem_espacos = linha.strip() # remove os espaços do início e fim, e as quebras de linha
        if linha_sem_espacos == 'COTIA CENTRO':
            # começa um novo cupom
            texto_cupom = '' # começa um novo texto
            cupom = {}
        elif m := re_extrato.search(linha_sem_espacos):
            cupom['extrato'] = m.group(1)
        elif m := re_total.search(linha_sem_espacos):
            cupom['total'] = m.group(1)
        elif linha_sem_espacos == 'carro': # final de um cupom
            cupom['texto'] = texto_cupom # salva o texto todo
            cupons.append(cupom) # adiciona na lista de cupons

That is, when a coupon starts I "zero" the text, and when it ends, I add the current text in the dictionary.

  • Thank you very much, you are sensational ;D

Browser other questions tagged

You are not signed in. Login or sign up in order to post.