If every line can only give match in one of the expressions, one option is to use alternation, through the character |
.
Basically, just do it expressao1|expressao2|expressao3...
. Thus, regex will test each of the alternatives, until some match be found. In case, to mount this regex, I will use join
to unite all expressions at once.
Another detail is that the \
must be escaped (written as \\
, for being inside strings). The way it is, the \b
is interpreted as the character BACKSPACE. So he can be interpreted as the word Boundary de regex (which I believe is the intention), \
would need to be escaped.
But generally, in Python, it is better to use raw string literals to regex, placing a r
in front of the opening quotes, thus the \
no need to escape:
import re
starts_with_Y = r'([Y][A-Za-z0-9]{6}([-][A-Za-z0-9]{1})?\s)'
starts_with_t = r'([t][a-zA-Z][][A-Za-z0-9]{3}[][A-Za-z0-9]([A-Za-z0-9])?\s)'
starts_with_Q = r'([Q]\d{4}\s)'
starts_with_RDN = r'(\b(\w*RDN\d{1,2}[-]\d\w*)\b)'
starts_with_snR = r'(\b(\w*snR\d{1,3}([-][A-Za-z0-9])?\w*)\b[ ])'
starts_with_NME = r'(NME\d{1}[ ])'
starts_with_ICR = r'(ICR\d{1}[ ])'
starts_with_LSR = r'(LSR\d{1}[ ])'
r = re.compile('|'.join([starts_with_Y, starts_with_t, starts_with_Q, starts_with_RDN, starts_with_snR, starts_with_NME, starts_with_ICR, starts_with_LSR]))
With that, the regex becomes starts_with_Y
, or starts_with_t
, or starts_with_Q
, etc. A possible use would be:
for linha in arquivo:
m = r.search(linha)
if m: # foi encontrado um match na linha
print(m.group()) # obter o trecho encontrado pelo match
You can simplify expressions.
When you just want to capture a single character, you don’t need the brackets, so [Y]
is the same as Y
. Even [ ]
for space, can be exchanged for a space even (although in this specific case it may not be so clear that there is a space). Already []
does not mean anything and can be removed (literally corresponds to "nothing" - or, depending on the language/engine, is considered an invalid expression).
And {1}
means "exactly an occurrence", but by default, anything placed on a regex - without a quantifier - corresponds to an occurrence of that thing (x{1}
is the same as x
), then it can be removed too.
[A-Za-z0-9]{3}[A-Za-z0-9]([A-Za-z0-9])?
means that [A-Za-z0-9]
occurs 3 times, then again, and optionally again. That is, this can occur 4 or 5 times, so just do [A-Za-z0-9]{4,5}
- the syntax {x,y}
means "at least x times, and at most y times".
Finally, the last 3 expressions are very similar (3 specific letters followed by a number and a space), so you could join them in ((NME|ICR|LSR)\d )
(starts with "NME" or "ICR" or "LSR", followed by number and space).
In short, I could stay like this:
starts_with_Y = r'(Y[A-Za-z0-9]{6}(-[A-Za-z0-9])?\s)'
starts_with_t = r'(t[a-zA-Z][A-Za-z0-9]{4,5}\s)'
starts_with_Q = r'(Q\d{4}\s)'
starts_with_RDN = r'(\b(\w*RDN\d{1,2}-\d\w*)\b)'
starts_with_snR = r'(\b(\w*snR\d{1,3}(-[A-Za-z0-9])?\w*)\b )'
starts_with_NME_ICR_LSR = r'((NME|ICR|LSR)\d )'
r = re.compile('|'.join([starts_with_Y, starts_with_t, starts_with_Q, starts_with_RDN, starts_with_snR, starts_with_NME_ICR_LSR]))
Perfect, very grateful.
– FourZeroFive