How to do a regex to capture a sequence in numbers with dots and strings

Asked

Viewed 1,003 times

2

Examples are:

1.1. FLESH
1.1.2. BRAIN

I want to make sure I get sequences number.number. ... space and a high-box string (can’t be low box).

I tried this:

reg = r'[1-9][.]+\s\b[A-Z]*'

However, the result considers only the last sequence number + point:

Upshot:

2. BRAIN

2 answers

3


You can use the following:

import re

texto = """
1.1. FLESH
1.1.2. BRAIN
"""

r = re.compile(r'^(?:[1-9]\.)+\s[A-Z]+', re.MULTILINE)
results = r.findall(texto)
print(results) # ['1.1. FLESH', '1.1.2. BRAIN']

I used the bookmark ^, which usually means "beginning of the string", but with the flag MULTILINE, it happens to mean "beginning of the line". This will ensure that it will only pick up the numbers if they are at the beginning of the line (as seems to be the case).

Then there’s the sequence [1-9]\. (a digit from 1 to 9, followed by a dot). All this is in parentheses, and the quantifier + after the parentheses indicates that this whole group (digit followed by a dot) can repeat once or more times. Thus, the regex will take 1., 1.1. and 1.1.1.1.1., etc..

If you want to limit the number of times the "digit followed by a dot" is repeated, you can change the + for {min,max}. For example, (?:[1-9]\.){2,5} means that it can only be repeated at least 2 and at most 5 times (it is also possible to use {2,}: at least twice, without a ceiling).

Then there’s the space and [A-Z]+ (one or more capital letters). In your regex you used [A-Z]* (zero or more letters), then your regex will give match even if you have nothing after the space. If you want to force that you have at least one letter, use the + (or use the options {min,max}, if you want to be more specific about the amount of letters allowed).

Finally, I used the method findall, which returns a list of all regex occurrences found. The result is the list:

['1.1. FLESH', '1.1.2. BRAIN']

Note that the parentheses used (?:. That’s for them to be one catch group. If I only used (, they would be a catch group, and the method findall returns the groups when they are present. As I want to return all the match, I needed to use the catch-no-group.


If you’re searching one string at a time, you don’t need flag MULTILINE:

r = re.compile(r'^(?:[1-9]\.)+\s[A-Z]+')
print(r.findall('1.1. FLESH')) # ['1.1. FLESH']
print(r.findall('1.1.2. BRAIN')) # ['1.1.2. BRAIN']

In the case, findall is useful for finding all occurrences at once (if you have more than one in the string). But if you only want one occurrence, you can use the method match:

r = re.compile(r'^(?:[1-9]\.)+\s[A-Z]+')
match = r.match('1.1. FLESH')
if match: # se encontrou um match
    print(match.group()) # 1.1. FLESH

If the numbers do not necessarily occur at the beginning of the line, simply remove the ^ of the regex (and also does not need the flag MULTILINE):

r = re.compile(r'(?:[1-9]\.)+\s[A-Z]+')

So the text could be 1.1. FLESH 1.1.2. BRAIN (all on the same line), that the findall you will find both occurrences (as it is not clear that the numbers only occur at the beginning of the line, choose the option that fits best in your cases).

The method findall returns a list of occurrences found. But if you want to iterate through them, just use the method finditer, which returns an iterator of pouch. The difference is that findall returns a list of strings, while finditer returns objects match, which contains various information about the excerpt found. Ex:

import re

texto = "1.1. FLESH  1.1.2. BRAIN"

r = re.compile(r'(?:[1-9]\.)+\s[A-Z]+')

for m in r.finditer(texto):
    print('String "{}" encontrada entre as posições {} e {}'.format(m.group(), m.start(), m.end()))

Exit:

String "1.1. FLESH" encontrada entre as posições 1 e 11
String "1.1.2. BRAIN" encontrada entre as posições 13 e 25

In your regex you used [1-9][.]+ (a digit from 1 to 9, followed by one or more dots), so it doesn’t take all the digits.

In addition, the shortcut \b (word Boundary - "word frontier") is redundant, since the regex already indicates that it must have a space before the letters, it already denotes a word border (a position of the string that has an alphanumeric character before and a non-alphinical character after, or vice versa). Hence the \b is not necessary in this case.

0

Regular Expression

If the string is on the same line, and does not have a new line (CR or LF), use the following Regular Expression:

\d+(?:\.\d+)*[\s\S]*?\D(?=\s*(?:\d|$))

And see the demo on Regex101:

Resultado da Regex 101

Details

  • \d+ - 1 or more digits
  • (?:\.\d+)* - zero or more sequences of:
    • \. - dot
    • \d+ - 1 or more digits
    • * - Quantifier
  • [\s\S]*? - any 0+ chars, minimum possible, until the first...
  • \D combines any non-digit symbol
  • The (?=\s*(?:\d+|$)) positive verifier requires the presence of zero or more blanks [ \r\n\t\f\v] ((\s*)) followed by one or more digits (\d+) or end of string ($) immediately to the right.

Python

For the program in Python, use the library re

Browser other questions tagged

You are not signed in. Login or sign up in order to post.