Picking numbers from a string - python

Asked

Viewed 74 times

1

How could I get just the digits of that string?

<SCANNER A7899739503929>

I’ve tried using the findall, but unsuccessfully it returns me several separate numbers:

import re



txt = open("P:/portal/cupons/sco/sc_op.024_10_06.txt", "r+").read()


# usando finditer ele retorna a posição da palavra

    # Localiza o inicio do cupom
x = re.finditer(r"CAIXA.*", txt)
    # Localiza o fim do cupom
z = re.finditer(r"SUBTOTAL.*", txt)

espelhos = list(zip(x, z))

# Testando o valor específico para cada espelho
for espelho in espelhos:

    txt_espelho = txt[espelho[0].span()[0]: espelho[1].span()[1] + 1]

    print('===================================================================================================================================')

    codigos = re.findall(r"<SCANNER.*", txt_espelho)

    print(codigos)

These are the separate codes:

<SCANNER A7891203021106>', '<SCANNER A7891203021106>', '<SCANNER A7891203021304>', '<SCANNER A7891203021304>', '<SCANNER A7891962036984>'

How can I get just the numbers?

3 answers

4

Simply put, you can do it:

txt = '<SCANNER A7899739503929>'

somente_digitos = ''.join([d for d in txt if d.isdigit()])

print(somente_digitos)
'7899739503929'

If you have a list

>>> lista = ['<SCANNER A7891203021106>', '<SCANNER A7891203021106>', '<SCANNER A7891203021304>', '<SCANNER A7891203021304>', '<SCANNER A7891962036984>']


>>> for item in lista:
...     print(''.join([d for d in item if d.isdigit()]))
...
7891203021106
7891203021106
7891203021304
7891203021304
7891962036984

Or use the map

>>> lista = ['<SCANNER A7891203021106>', '<SCANNER A7891203021106>', '<SCANNER A7891203021304>', '<SCANNER A7891203021304>', '<SCANNER A7891962036984>']

>>> print(list(map(lambda item: ''.join([d for d in item if d.isdigit()]), lista)))
['7891203021106', '7891203021106', '7891203021304', '7891203021304', '7891962036984']
  • Perfect , however they are all coming glued , would have to separate each code ?

  • 1

    There’s always... I believe the variable codigos is a string list, right? So use the last example with codigos instead of lista

4


Just exchange for:

re.findall(r"<SCANNER A(\d+)", txt_espelho)

In the case, \d is a shortcut to "digits" and the quantifier + means "one or more".

Already parentheses create a capture group, and when regex has capture groups, findall returns only them. Thus, the return will be a list containing only the numbers that are just after SCANNER A.

Your regex didn’t work because you used it .*, and as the dot picks up any character, it can end up coming characters that are not digits. You could even use just findall(r'\d+', ...), but then return numbers that are not after SCANNER A (I don’t know if there is, if there isn’t it won’t make a difference).

Finally, I think this is not the best way to treat this file (with several regex sweeping everything over and over). The best would be to read the file line by line and process the data bit by bit as they are found, as suggested here.

  • Perfect bro , but the output is being several duplicate coupons link , Each block of '===' is equivalent to a certain coupon and are coming out 3 of each, you know the reason why ?

  • @Ineedcoffe Without seeing the complete file you can’t guess. I suggest you open another question, putting some sample coupons. But my guess is that the text you pass must have more than one coupon, I don’t know... As I said, I don’t think this is the best solution, try to adapt the code of the answer given in another question, I believe it is a better solution...

  • I am blocked from asking questions :C , but this is the link of the file, I need to take the word OPEN BOX up to SUBTOTAL and between that take all the codes that are inside the <SCANNER> and separate all this in blocks but leaving all duplicated ;C

3

The regex you seek is [0-9]+ which matches the parts of the string it contains at least one or more digits:

import re

entrada = "<SCANNER A7891203021106>, <SCANNER A7891203021106>, <SCANNER A7891203021304>, <SCANNER A7891203021304>', '<SCANNER A7891962036984>"
codigos = re.findall("[0-9]+", entrada)

print(codigos)

output:

['7891203021106',
 '7891203021106',
 '7891203021304',
 '7891203021304',
 '7891962036984']
  • 2

    The solution is good for the presented case, but if the term has number and mixed letters, the solution will not work. Example <SCANNER A123B456C789D>

Browser other questions tagged

You are not signed in. Login or sign up in order to post.