Check the relation of two objects in a list

Asked

Viewed 49 times

0

I need to extract from a pure text the full value of an agreement. I have hundreds of documents with some figures, and I realized that generally the highest value is also the total value of the agreement, but in some cases, no.

def ata_values(text):
    padrao = re.findall(r'\$\s*(\d{1,3}(?:\.?\d{1,3})+(?:\,\d{2})?)', text)
    padrao = [p.replace('.', '') for p in padrao]
    padrao = [p.replace(',', '.') for p in padrao]
    padrao = [float(p) for p in padrao]

    return padrao, max(padrao)

it returns to me:

([2500.0, 833.33, 833.33, 833.34, 2500.0], 2500.0)
([1000.0, 800.0, 200.0, 1000.0], 1000.0)
([280.0, 14000.0, 21000.0], 21000.0)    21000.0)
([3000.0, 15000.0, 7000.0, 7000.0, 7000.0, 700.0, 700.0, 700.0, 700.0, 700.0, 700.0, 700.0, 700.0, 700.0, 700.0, 750.0, 750.0, 750.0, 1083.33, 1200.0, 1600.0, 1616.67, 140.0], 15000.0)

being the first default list with all values found, and the second max(pattern) is the highest value of each list. In this example the first two lines are correct, but the last two are not, in these the second major is the correct one. I realized that in most lists I have this error, there is a pattern, the list contains the total value plus a value that corresponds to 2% of the total value.

As I could check before taking the maximum values, if there is within each list a number X plus a number that corresponds to 0.02*X?

for x in padrao:
    for y in padrao:
        if x == y*0.02:
            return x
        else: 
            return max(padrao)
  • In the third line the expected result would be 14000 and 280 corresponds to 2% of this total. In the fourth line the expected result would be 7000 and 140 corresponds to 2% of this total

1 answer

0


padrao.sort(reverse=True)
maior = padrao[0]
if maior* 0.02 in padrao:
   maior = padrao[1]

But as I wrote initially as a comment: This approach is very risky - especially if the documents are free text. Isn’t there something more consistent, even if it’s harder to find with a single regular expression? like, does the word "total" appear next to the number, or in a specific session of the document, or always near the end of the document? In this case you would first isolate an excerpt where the total value should appear and only then worry about taking the number.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.