Combine multiple Regular Expressions into one

Question

Combine multiple Regular Expressions into one

Asked 6 years ago

Viewed 285 times

2

Is there any way to combine multiple regex patterns into a single expression, to be used in re.match() or re.search(), for example?

starts_with_Y = '([Y][A-Za-z0-9]{6}([-][A-Za-z0-9]{1})?\s)'
starts_with_t = '([t][a-zA-Z][][A-Za-z0-9]{3}[][A-Za-z0-9]([A-Za-z0-9])?\s)'
starts_with_Q = '([Q]\d{4}\s)'
starts_with_RDN = '(\b(\w*RDN\d{1,2}[-]\d\w*)\b)'
starts_with_snR = '(\b(\w*snR\d{1,3}([-][A-Za-z0-9])?\w*)\b[ ])'
starts_with_NME = '(NME\d{1}[ ])'
starts_with_ICR = '(ICR\d{1}[ ])'
starts_with_LSR = '(LSR\d{1}[ ])'

I need a single regex, because the application reads a text file line by line, and in each line appears only one of each. Example:

D      YHR077C NMD2; Nmd2p  K14327 UPF2; regulator of nonsense transcripts 2
D      YGR072W UPF3; Upf3p  K14328 UPF3; regulator of nonsense transcripts 3
D      snR19 SNR19  K14276 U1snRNA; U1 spliceosomal RNA
D      LSR1 LSR1    K14277 U2snRNA; U2 spliceosomal RNA
D      snR14 SNR14  K14278 U4snRNA; U4 spliceosomal RNA
D      snR7-S SNR7-S    K14279 U5snRNA; U5 spliceosomal RNA
D      snR7-L SNR7-L    K14279 U5snRNA; U5 spliceosomal RNA
D      snR6 SNR6    K14280 U6snRNA; U6 spliceosomal RNA
D      snR17a SNR17A    K14483 U3snoRNA; U3 small nucleolar RNA
D      snR17b SNR17B    K14483 U3snoRNA; U3 small nucleolar RNA

And it is possible to use the re.compile()?

1 answer

Browser other questions tagged python regex

You are not signed in. Login or sign up in order to post.

by hkotsubo • **55,826** points · Answer 1 · 2019-07-17T22:43:55+00:00

If every line can only give match in one of the expressions, one option is to use alternation, through the character |.

Basically, just do it expressao1|expressao2|expressao3.... Thus, regex will test each of the alternatives, until some match be found. In case, to mount this regex, I will use join to unite all expressions at once.

Another detail is that the \ must be escaped (written as \\, for being inside strings). The way it is, the \b is interpreted as the character BACKSPACE. So he can be interpreted as the word Boundary de regex (which I believe is the intention), \ would need to be escaped.
But generally, in Python, it is better to use raw string literals to regex, placing a r in front of the opening quotes, thus the \ no need to escape:

import re

starts_with_Y = r'([Y][A-Za-z0-9]{6}([-][A-Za-z0-9]{1})?\s)'
starts_with_t = r'([t][a-zA-Z][][A-Za-z0-9]{3}[][A-Za-z0-9]([A-Za-z0-9])?\s)'
starts_with_Q = r'([Q]\d{4}\s)'
starts_with_RDN = r'(\b(\w*RDN\d{1,2}[-]\d\w*)\b)'
starts_with_snR = r'(\b(\w*snR\d{1,3}([-][A-Za-z0-9])?\w*)\b[ ])'
starts_with_NME = r'(NME\d{1}[ ])'
starts_with_ICR = r'(ICR\d{1}[ ])'
starts_with_LSR = r'(LSR\d{1}[ ])'

r = re.compile('|'.join([starts_with_Y, starts_with_t, starts_with_Q, starts_with_RDN, starts_with_snR, starts_with_NME, starts_with_ICR, starts_with_LSR]))

With that, the regex becomes starts_with_Y, or starts_with_t, or starts_with_Q, etc. A possible use would be:

for linha in arquivo:
    m = r.search(linha)
    if m: # foi encontrado um match na linha
        print(m.group()) # obter o trecho encontrado pelo match

You can simplify expressions.

When you just want to capture a single character, you don’t need the brackets, so [Y] is the same as Y. Even [ ] for space, can be exchanged for a space even (although in this specific case it may not be so clear that there is a space). Already [] does not mean anything and can be removed (literally corresponds to "nothing" - or, depending on the language/engine, is considered an invalid expression).

And {1} means "exactly an occurrence", but by default, anything placed on a regex - without a quantifier - corresponds to an occurrence of that thing (x{1} is the same as x), then it can be removed too.

[A-Za-z0-9]{3}[A-Za-z0-9]([A-Za-z0-9])? means that [A-Za-z0-9] occurs 3 times, then again, and optionally again. That is, this can occur 4 or 5 times, so just do [A-Za-z0-9]{4,5} - the syntax {x,y} means "at least x times, and at most y times".

Finally, the last 3 expressions are very similar (3 specific letters followed by a number and a space), so you could join them in ((NME|ICR|LSR)\d ) (starts with "NME" or "ICR" or "LSR", followed by number and space).

In short, I could stay like this:

starts_with_Y = r'(Y[A-Za-z0-9]{6}(-[A-Za-z0-9])?\s)'
starts_with_t = r'(t[a-zA-Z][A-Za-z0-9]{4,5}\s)'
starts_with_Q = r'(Q\d{4}\s)'
starts_with_RDN = r'(\b(\w*RDN\d{1,2}-\d\w*)\b)'
starts_with_snR = r'(\b(\w*snR\d{1,3}(-[A-Za-z0-9])?\w*)\b )'
starts_with_NME_ICR_LSR = r'((NME|ICR|LSR)\d )'

r = re.compile('|'.join([starts_with_Y, starts_with_t, starts_with_Q, starts_with_RDN, starts_with_snR, starts_with_NME_ICR_LSR]))