How to access the position of multiple matchs of a regex in a string?

Asked

Viewed 84 times

3

I know I can access the position of a match of a regex in a string using the methods start, end and span. Example of a program that identifies repetitions in a text:

import re
from colorama import Fore,Style

text='''Dorsey was born and raised in St. Louis, Missouri,[8][9] 
the son of Tim and Marcia (née Smith) Dorsey.[10][11][12] He is of English, Irish, and Italian descent.[13] 
His father worked for a company that developed mass spectrometers and his mother was a homemaker.[14] 
He was raised Catholic, and his uncle is a Catholic Catholic priest in Cincinnati.[15] He attended the Catholic 
Bishop DuBourg High School. In his younger days, Dorsey worked occasionally as a fashion model.
[16][17][18][19][20] By age 14, Dorsey had become interested in dispatch dispatch routing. Some of the open-source software he created in the area of dispatch logistics is still used by taxicab companies.[10] Dorsey enrolled at the University of Missouri–Rolla in 1995 and attended for two-plus years[15] before transferring to New York University in 1997, but he dropped out two years later,[21] one semester short of graduating.[15] 
He came up with the idea that he developed as Twitter while studying at NYU.[15][22]
'''

print("Searching for repeated words ...", "\n")
try:
    result=re.search(r'(\w{3,}\s)\1',text)
    start=result.start()
    end=result.end()
    value=result.group()
    print("The word \"{}\" is repeated at: ".format(value.split(' ')[0]),"\n\n")

    print(text[start-100:start]+ Fore.RED + text[start:end]+ Style.RESET_ALL+text[end:end+200])
except:
    print("No repeated words found")

Returns:

inserir a descrição da imagem aqui

Note that the problem of this program is that it identifies only one occurrence. I imagined that the method start return a list or tuple when there is more than one match, but that is not what happens.

How can I access the position of all matchs of a regular expression in a string? For example, the word Dispatch also repeats in the text, but I do not know how to get the position of it.

  • Not directly related, but when search finds nothing he returns None, then instead of try/except, just do if result: encontrou else: não encontrou

  • Only consecutive repeated words ex: Catholic Catholic and dispatch dispatch or all the repeated words ex: Dorsey

2 answers

5


You are using the function re.search to do the match of the regular expression. According to the documentation:

Scan through the string looking for first position where regular expression produces a match and returns the match Object correspondent. [...]

Therefore, as the function search (as well as match and fullmatch) only allow one search, you must use another function, which allows you to perform multiple searches successive in a string.

Python provides the functions re.findall re.finditer to carry out the "complete" search, going beyond the first occurrence if present. So that:

  • re.findall returns all the pouch without overwriting as a list of strings.
  • re.finditer returns an iterator that produces match Objects for each match in string.

How your code uses methods present in match Objects, finditer seems more ideal.

Note that as finditer returns a iterator, it is necessary to use some type of means to iterate over each match object produced by it. In case the string has not provided any match, iterator will be empty, so no iteration will occur.

In the example of the question, it would be something like:

import re
from colorama import Fore, Style

text = '''Dorsey was born and raised in St. Louis, Missouri,[8][9]
the son of Tim and Marcia (née Smith) Dorsey.[10][11][12] He is of English, Irish, and Italian descent.[13]
His father worked for a company that developed mass spectrometers and his mother was a homemaker.[14]
He was raised Catholic, and his uncle is a Catholic Catholic priest in Cincinnati.[15] He attended the Catholic
Bishop DuBourg High School. In his younger days, Dorsey worked occasionally as a fashion model.
[16][17][18][19][20] By age 14, Dorsey had become interested in dispatch dispatch routing. Some of the open-source software he created in the area of dispatch logistics is still used by taxicab companies.[10] Dorsey enrolled at the University of Missouri–Rolla in 1995 and attended for two-plus years[15] before transferring to New York University in 1997, but he dropped out two years later,[21] one semester short of graduating.[15]
He came up with the idea that he developed as Twitter while studying at NYU.[15][22]
'''

print("Searching for repeated words ...", "\n")

all_matches_iter = re.finditer(r'(\w{3,}\s)\1', text)

for match in all_matches_iter:
    start = match.start()
    end = match.end()
    value = match.group()

    print("The word \"{}\" is repeated at: ".format(
        value.split(' ')[0]), "\n\n")
    print(text[start-100:start] + Fore.RED +
          text[start:end] + Style.RESET_ALL+text[end:end+200])

3

Just to give another alternative (because what I would really use is finditer, in accordance with another answer already explained).

You can use search indicating the position in which the search should begin, and loop until there is no more pouch:

import re

r = re.compile(r'(\w{3,}\s)\1')
end = 0
print("Searching for repeated words ...", "\n")
while True:
    result = r.search(text, end)
    if not result: break # se não achou, sai do while
    start = result.start()
    end = result.end()
    value=result.group()
    # o resto é igual (print, etc)

Note that each search starts at the position where the previous one ends (except for the first one, which starts from the beginning of the string).

Interestingly, this option to pass the starting position as a parameter is only available for the class method Pattern (which is what compile returns), but not in function re.search module.


Taking advantage, your regex can be improved because it has some problems.

For example, if the text has something like "Não seria ria etc...", the stretch "ria ria" will be found (the first "ria" is from the word "would be").

And if you have something like "etc etc.", it is not found as you have placed the space within the capture group, so \1 you will only find the word if you also have a space after.

To fix this, the regex should be: r'(\b\w{3,}\b)\s\1' - in the case, the \b serves to delimit the word (read here for more details) and placed the space outside the parentheses (so, result.group() return the word without the space - in your code, is returning with the space after, I do not know if that was the intention).

It is worth remembering that \w takes not only letters, but also digits and the character _. If you only want letters (including accents), you can switch to r'(\b[^\W\d_]{3,}\b)\s\1'.

  • 1

    This part of the position where you should probably start (not confirmed) should help a little (or a lot) in performance, I promise to return a benchmark comparing in a much simpler example this and the finditer.

  • 1

    @Guilhermenascimento I did a basic test with timeit and there was no significant difference (sometimes finditer is slightly faster, sometimes not, in practice gave technical tie): https://replit.com/@hkotsubo/Regexsearchvsfinditer#main.py | https://ideone.com/jYfS14 - I believe that in many cases it will not make a difference even if, maybe one case or another depending on the string and/or regex can make a difference.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.