How to capture a number in a string using regular expressions (or similar method) in Python?

Asked

Viewed 277 times

3

I would like to know how I can select a specific part of my text considering that this part refers to a value that can change. For example, the string:

#   44,739 % of all cache refs

I would like to extract only the value 44,739. Only this value can be modified in another line, so I would not put as default value of my expression, what I really wanted is to select the text between the characters # and %. Has as?

  • I edited the question and noticed that, between the # and the number, there are three space characters (instead of the single character that was being shown by the rendered HTML). There are three same spaces?

  • @Miguel, I reversed because the three spaces were in the original text.

  • @Luizfelipe I apologize then, I was able to swear that I had copied completely and a single space. I will adjust

  • @Miguel, yes, the original HTML was just rendering a space (it’s the expected HTML behavior). However, I do not know if it was really to have three spaces. That’s why I left the first comment. Let’s see what the AP tells us.

  • 'Cause it’s true @Luizfelipe, html was framed, obnoxious by the warning

2 answers

4

You can use the following regular expression:

r"(\d+,\d+)"

See on Regex101.

It will select any number contained in the string (being , decimal separator). If numbers can appear in other parts of the string, you can limit the expression to only search for numbers between # (at the beginning of the string) and % (after the number). It looks like this:

r"^# (\d+,\d+) %"

See on Regex101.

Note that in the two regular expressions, there are capture groups so that we can "extract" the number in possible match. See a functional example:

import re

data = '# 44,739 % of all cache refs'

match = re.search(r"^# (\d+,\d+) %", data)

# Pega o primeiro grupo de captura e imprime:
num_str = match.group(1)
print(num_str)

But note that as the default string is evidently simple and in this case regular expression may not be necessary. A another answer provides an alternative.

Only as an addendum, if you need to make use of the number contained in the string, you need to convert the decimal separator from comma to a point before performing the parse. Example:

num = float(num_str.replace(',', '.'))

4


If the idea is to take only the numbers that are between # and %, then do:

import re
texto = "# 44,739 % of all cache refs 12,345 lorem ipsum # 98,736 % etc 45,678 blablbla"
r = re.compile(r'#\s*(\d+,\d+)\s*%')
for n in r.findall(texto):
    print(n)

In case, the regex will only take 44,739 and 98,736. The other numbers will be ignored as they are not among # and %. Remembering that before and after the number has \s*, indicating zero or more spaces.

But if the idea is to take all the numbers, regardless of what you have before or after, then just do re.compile('\d+,\d+').

It is also unclear whether the number always has "digits, comma, digits". If the digits after the comma are optional, you can switch to re.compile(r'#\s*(\d+(?:,\d+)?)\s*%').

An important detail is that the section that has the number is in parentheses to form a catch group. I did it because when regex has capture groups, findall returns only the contents of the groups.
And that’s why in the option with the optional comma I use (?:, because it forms a catch group (so the contents of this is not returned separately by findall).


One of the answers used [\d\,\d]+, but this can give false positives, because this regex also takes only the comma. For example:

print(re.findall(r'[\d\,\d]+', 'a, b, 10, d')) # [',', ',', '10,']

The result will have the 2 commas after "a" and "b", plus 10,. Already using the above options, this problem does not occur. Namely, this occurs because the brackets define a character class, then [\d,\d] means "a \d (digit), or a comma, or one \d" (only one of them, and yes, the \d is redundant in this case), so this regex takes only one comma alone (and would also take several, such as ,,,,,).


Regex-free

But if you "know" that there is only one occurrence of the number (or only want the first one), you can do without regex:

try:
    texto = "# 44,739 % of all cache refs"
    start = texto.index('#') + 1
    end = texto.index('%', start + 1)
    possivel_numero = texto[start:end].strip()
    numero = float(possivel_numero.replace(',', '.'))
    print(numero)
except ValueError:
    print('não tem um número ou não tem os caracteres # ou %')

I use index to obtain the positions of the # and %, and use a Slice ([start:end]) to get the section between these positions. And then I try to convert to number (I had to exchange the comma for a point, since float recognizes only the point as decimal separator - but if the number uses American notation, in which the comma separates the thousands, then do replace(',', '')).

If conversion to number fails, or you do not have one of the specified characters (# or %), a ValueError.


Or, if you want to find all the occurrences:

texto = "# 44,739 % of all cache refs 12,345 lorem ipsum # 98,736 % etc 45,678 blablbla # abc %"
start = end = 0
while True:
    start = texto.find('#', end)
    if start == -1:
        print('Não tem mais nenhum #')
        break
    end = texto.find('%', start + 1)
    if end == -1:
        print('Não tem mais nenhum %')
        break
    possivel_numero = texto[start + 1:end].strip()
    try:
        numero = float(possivel_numero.replace(',', '.'))
        print(numero)
    except ValueError:
        print('não é um número')

Another option is to use partition:

texto = "#    44,739   % of all cache refs 12,345 lorem ipsum # 98,736 % etc 45,678 blablbla # abc %"
tmp = texto
while True:
    _, _, tmp = tmp.partition('#')
    if not tmp: # não tem mais #
        break
    possivel_numero, sep, tmp = tmp.partition('%')
    if not sep: # não tem mais %
        break
    try:
        numero = float(possivel_numero.replace(',', '.').strip())
        print(numero)
    except ValueError:
        print('não é um número')

Basically, partition returns a tuple containing the parts before and after the separator, in addition to the separator itself. If there is no more way to separate, it returns empty strings.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.