If the idea is to take only the numbers that are between #
and %
, then do:
import re
texto = "# 44,739 % of all cache refs 12,345 lorem ipsum # 98,736 % etc 45,678 blablbla"
r = re.compile(r'#\s*(\d+,\d+)\s*%')
for n in r.findall(texto):
print(n)
In case, the regex will only take 44,739
and 98,736
. The other numbers will be ignored as they are not among #
and %
. Remembering that before and after the number has \s*
, indicating zero or more spaces.
But if the idea is to take all the numbers, regardless of what you have before or after, then just do re.compile('\d+,\d+')
.
It is also unclear whether the number always has "digits, comma, digits". If the digits after the comma are optional, you can switch to re.compile(r'#\s*(\d+(?:,\d+)?)\s*%')
.
An important detail is that the section that has the number is in parentheses to form a catch group. I did it because when regex has capture groups, findall
returns only the contents of the groups.
And that’s why in the option with the optional comma I use (?:
, because it forms a catch group (so the contents of this is not returned separately by findall
).
One of the answers used [\d\,\d]+
, but this can give false positives, because this regex also takes only the comma. For example:
print(re.findall(r'[\d\,\d]+', 'a, b, 10, d')) # [',', ',', '10,']
The result will have the 2 commas after "a" and "b", plus 10,
. Already using the above options, this problem does not occur. Namely, this occurs because the brackets define a character class, then [\d,\d]
means "a \d
(digit), or a comma, or one \d
" (only one of them, and yes, the \d
is redundant in this case), so this regex takes only one comma alone (and would also take several, such as ,,,,,
).
Regex-free
But if you "know" that there is only one occurrence of the number (or only want the first one), you can do without regex:
try:
texto = "# 44,739 % of all cache refs"
start = texto.index('#') + 1
end = texto.index('%', start + 1)
possivel_numero = texto[start:end].strip()
numero = float(possivel_numero.replace(',', '.'))
print(numero)
except ValueError:
print('não tem um número ou não tem os caracteres # ou %')
I use index
to obtain the positions of the #
and %
, and use a Slice ([start:end]
) to get the section between these positions. And then I try to convert to number (I had to exchange the comma for a point, since float
recognizes only the point as decimal separator - but if the number uses American notation, in which the comma separates the thousands, then do replace(',', '')
).
If conversion to number fails, or you do not have one of the specified characters (#
or %
), a ValueError
.
Or, if you want to find all the occurrences:
texto = "# 44,739 % of all cache refs 12,345 lorem ipsum # 98,736 % etc 45,678 blablbla # abc %"
start = end = 0
while True:
start = texto.find('#', end)
if start == -1:
print('Não tem mais nenhum #')
break
end = texto.find('%', start + 1)
if end == -1:
print('Não tem mais nenhum %')
break
possivel_numero = texto[start + 1:end].strip()
try:
numero = float(possivel_numero.replace(',', '.'))
print(numero)
except ValueError:
print('não é um número')
Another option is to use partition
:
texto = "# 44,739 % of all cache refs 12,345 lorem ipsum # 98,736 % etc 45,678 blablbla # abc %"
tmp = texto
while True:
_, _, tmp = tmp.partition('#')
if not tmp: # não tem mais #
break
possivel_numero, sep, tmp = tmp.partition('%')
if not sep: # não tem mais %
break
try:
numero = float(possivel_numero.replace(',', '.').strip())
print(numero)
except ValueError:
print('não é um número')
Basically, partition
returns a tuple containing the parts before and after the separator, in addition to the separator itself. If there is no more way to separate, it returns empty strings.
I edited the question and noticed that, between the
#
and the number, there are three space characters (instead of the single character that was being shown by the rendered HTML). There are three same spaces?– Luiz Felipe
@Miguel, I reversed because the three spaces were in the original text.
– Luiz Felipe
@Luizfelipe I apologize then, I was able to swear that I had copied completely and a single space. I will adjust
– Miguel
@Miguel, yes, the original HTML was just rendering a space (it’s the expected HTML behavior). However, I do not know if it was really to have three spaces. That’s why I left the first comment. Let’s see what the AP tells us.
– Luiz Felipe
'Cause it’s true @Luizfelipe, html was framed, obnoxious by the warning
– Miguel