You can use the following:
import re
texto = """
1.1. FLESH
1.1.2. BRAIN
"""
r = re.compile(r'^(?:[1-9]\.)+\s[A-Z]+', re.MULTILINE)
results = r.findall(texto)
print(results) # ['1.1. FLESH', '1.1.2. BRAIN']
I used the bookmark ^
, which usually means "beginning of the string", but with the flag MULTILINE
, it happens to mean "beginning of the line". This will ensure that it will only pick up the numbers if they are at the beginning of the line (as seems to be the case).
Then there’s the sequence [1-9]\.
(a digit from 1 to 9, followed by a dot). All this is in parentheses, and the quantifier +
after the parentheses indicates that this whole group (digit followed by a dot) can repeat once or more times. Thus, the regex will take 1.
, 1.1.
and 1.1.1.1.1.
, etc..
If you want to limit the number of times the "digit followed by a dot" is repeated, you can change the +
for {min,max}
. For example, (?:[1-9]\.){2,5}
means that it can only be repeated at least 2 and at most 5 times (it is also possible to use {2,}
: at least twice, without a ceiling).
Then there’s the space and [A-Z]+
(one or more capital letters). In your regex you used [A-Z]*
(zero or more letters), then your regex will give match even if you have nothing after the space. If you want to force that you have at least one letter, use the +
(or use the options {min,max}
, if you want to be more specific about the amount of letters allowed).
Finally, I used the method findall
, which returns a list of all regex occurrences found. The result is the list:
['1.1. FLESH', '1.1.2. BRAIN']
Note that the parentheses used (?:
. That’s for them to be one catch group. If I only used (
, they would be a catch group, and the method findall
returns the groups when they are present. As I want to return all the match, I needed to use the catch-no-group.
If you’re searching one string at a time, you don’t need flag MULTILINE
:
r = re.compile(r'^(?:[1-9]\.)+\s[A-Z]+')
print(r.findall('1.1. FLESH')) # ['1.1. FLESH']
print(r.findall('1.1.2. BRAIN')) # ['1.1.2. BRAIN']
In the case, findall
is useful for finding all occurrences at once (if you have more than one in the string). But if you only want one occurrence, you can use the method match
:
r = re.compile(r'^(?:[1-9]\.)+\s[A-Z]+')
match = r.match('1.1. FLESH')
if match: # se encontrou um match
print(match.group()) # 1.1. FLESH
If the numbers do not necessarily occur at the beginning of the line, simply remove the ^
of the regex (and also does not need the flag MULTILINE
):
r = re.compile(r'(?:[1-9]\.)+\s[A-Z]+')
So the text could be 1.1. FLESH 1.1.2. BRAIN
(all on the same line), that the findall
you will find both occurrences (as it is not clear that the numbers only occur at the beginning of the line, choose the option that fits best in your cases).
The method findall
returns a list of occurrences found. But if you want to iterate through them, just use the method finditer
, which returns an iterator of pouch. The difference is that findall
returns a list of strings, while finditer
returns objects match, which contains various information about the excerpt found. Ex:
import re
texto = "1.1. FLESH 1.1.2. BRAIN"
r = re.compile(r'(?:[1-9]\.)+\s[A-Z]+')
for m in r.finditer(texto):
print('String "{}" encontrada entre as posições {} e {}'.format(m.group(), m.start(), m.end()))
Exit:
String "1.1. FLESH" encontrada entre as posições 1 e 11
String "1.1.2. BRAIN" encontrada entre as posições 13 e 25
In your regex you used [1-9][.]+
(a digit from 1 to 9, followed by one or more dots), so it doesn’t take all the digits.
In addition, the shortcut \b
(word Boundary - "word frontier") is redundant, since the regex already indicates that it must have a space before the letters, it already denotes a word border (a position of the string that has an alphanumeric character before and a non-alphinical character after, or vice versa). Hence the \b
is not necessary in this case.
OK, @hkotsubo. I will test and give you feedback.
– Antonio Braz Finizola