0
I need to extract from a pure text the full value of an agreement. I have hundreds of documents with some figures, and I realized that generally the highest value is also the total value of the agreement, but in some cases, no.
def ata_values(text):
padrao = re.findall(r'\$\s*(\d{1,3}(?:\.?\d{1,3})+(?:\,\d{2})?)', text)
padrao = [p.replace('.', '') for p in padrao]
padrao = [p.replace(',', '.') for p in padrao]
padrao = [float(p) for p in padrao]
return padrao, max(padrao)
it returns to me:
([2500.0, 833.33, 833.33, 833.34, 2500.0], 2500.0)
([1000.0, 800.0, 200.0, 1000.0], 1000.0)
([280.0, 14000.0, 21000.0], 21000.0) 21000.0)
([3000.0, 15000.0, 7000.0, 7000.0, 7000.0, 700.0, 700.0, 700.0, 700.0, 700.0, 700.0, 700.0, 700.0, 700.0, 700.0, 750.0, 750.0, 750.0, 1083.33, 1200.0, 1600.0, 1616.67, 140.0], 15000.0)
being the first default list with all values found, and the second max(pattern) is the highest value of each list. In this example the first two lines are correct, but the last two are not, in these the second major is the correct one. I realized that in most lists I have this error, there is a pattern, the list contains the total value plus a value that corresponds to 2% of the total value.
As I could check before taking the maximum values, if there is within each list a number X plus a number that corresponds to 0.02*X?
for x in padrao:
for y in padrao:
if x == y*0.02:
return x
else:
return max(padrao)
In the third line the expected result would be 14000 and 280 corresponds to 2% of this total. In the fourth line the expected result would be 7000 and 140 corresponds to 2% of this total
– stacker