Short answer
Use re.split
instead of a super-regex-do-it-all.
Long answer
You can, if you want, try to make a single regex that solves everything (although I don’t think it’s the best solution, keep reading and understand the reasons). But anyway, a first attempt would be something like:
import re
r = re.compile(r"^(PROJETO|PROTOCOLOS)\s+N\s*(º|o)\s+\d+\.\d+,?\s+DE\s+\d{1,2}\s+DE\s+\w{4,9}\s+DE\s+\d{4}.$")
I used the markers ^
(string start) and $
(end of string) to delimit that the string can only have what I put in regex. If you don’t use them, regex can give match in strings that have characters before or after the text you want (if this is the case, just remove the ^
and $
).
I also use \s+
to spaces. This expression means "one or more spaces", and "spaces" can be: the white space itself, a TAB, line breaking (\n
), among others (see the complete list of characters that \s
consider reading the documentation - remembering that this list may vary according to the language and the way regex is created).
If you want only white space, you can use a white space before the +
, instead of \s+
, or [ ]+
, which in my opinion is a little less confusing, as you can notice that there is a space inside the brackets. Using something like etc +
, at least for me, it’s not so clear that there’s a gap between the c
and the +
(in larger expressions I have this problem, but there is a matter of taste, there are people who use so without problem).
The brackets define a character class and take whatever is inside them. That is to say, [ ]
will consider only the space (and not the other characters that \s
considers). It is usually redundant to use the brackets for a single character, but in the case of space, it may give more readability, as already explained.
This expression looks like that considers everything, but there are some problems in her.
You said in the comments who wants to "separate the radicals to handle manually". Then you would have to put parentheses around the snippets you want to capture, so that these excerpts can be recovered later. In the above regex, there are only two pairs of parentheses: (PROJETO|PROTOCOLOS)
and (º|o)
. This means that only these snippets will be available:
r = re.compile(r"^(PROJETO|PROTOCOLOS)\s+N\s*(º|o)\s+\d+\.\d+,?\s+DE\s+\d{1,2}\s+DE\s+\w{4,9}\s+DE\s+\d{4}.$")
textos = [
"PROJETO Nº 1.100 DE 28 DE DEZEMBRO DE 2018.",
"PROJETO No 1.100 DE 28 DE DEZEMBRO DE 2018.",
"PROTOCOLOS Nº 1.100 DE 28 DE DEZEMBRO DE 2018.",
"PROTOCOLOS No 1.100 DE 28 DE DEZEMBRO DE 2018."
]
for texto in textos:
for match in r.finditer(texto):
print(match.groups()) # imprimir os grupos de captura
The output of this code is:
('PROJETO', 'º')
('PROJETO', 'o')
('PROTOCOLOS', 'º')
('PROTOCOLOS', 'o')
Note that only the sections that were in parentheses were captured. If you want the first group, just do match.group(1)
(match.group(0)
returns the entire string, which corresponds to match found by regex).
If you want to capture other parts of regex, put them in parentheses as well, and they will be available on match. For example, to capture the project number and date fields:
r = re.compile(r"^(PROJETO|PROTOCOLOS)\s+N\s*(º|o)\s+(\d+\.\d+),?\s+DE\s+(\d{1,2})\s+DE\s+(\w{4,9})\s+DE\s+(\d{4}).$")
With this, the groups will be:
('PROJETO', 'º', '1.100', '28', 'DEZEMBRO', '2018')
('PROJETO', 'o', '1.100', '28', 'DEZEMBRO', '2018')
('PROTOCOLOS', 'º', '1.100', '28', 'DEZEMBRO', '2018')
('PROTOCOLOS', 'o', '1.100', '28', 'DEZEMBRO', '2018')
Other problems
Notice I changed the name of the month to \w{4,9}
(4 to 9 characters), as February has 9 letters (using {4,8}
, he would be out). The problem is that \w
also accepts numbers and the character _
, then abc_123
would be considered a valid month.
Another point is the month of Marco. In Python 2 the \w
only considers the ç
if the flag UNICODE
is set (see documentation and a example). Already in the Python 3, the ç
is considered without needing the flag (example).
It is also not clear if the name of the month is always uppercase. Anyway, \w
accepts lower case letters, so you might want to be more specific and use [A-ZÇ]{4,9}
, for example (only letters of A
to Z
or Ç
). Only still, this expression will still accept strings as ABCDEF
or even ÇÇÇÇÇ
(see here).
So the best thing would be to have something more specific, like:
JANEIRO|FEVEREIRO|MARÇO|ABRIL|MAIO|JUNHO|JULHO|AGOSTO|SETEMBRO|OUTUBRO|NOVEMBRO|DEZEMBRO
Or, if want to shorten a little:
(JAN|FEVER)EIRO|MA(RÇ|I)O|ABRIL|JU[NL]HO|AGOSTO|(SETEM|OUTU|NOVEM|DEZEM)BRO
For the days, use \d{1,2}
will accept any value between 0
and 99
, then maybe it’s better to change to something that accepts only values from 1 to 31. For example:
(trecho antes do dia) ... DE\s+(3[01]|[12]\d|[1-9])\s+DE ... (trecho depois do dia)
The same thing next year, because \d{4}
will accept from 0000
until 9999
. You can switch to something like (19|20)\d{2}
, which accepts years between 1900 and 2099. Or put even more complicated things to further restrict this range.
And for the protocol/project number, will the number always have the point as separator of thousands? Can you have cases of numbers smaller than 1000 (and therefore without the point)? I believe you can’t start with zero (01.200
would be invalid, right? ). Anyway, an option would be this regex:
[1-9](([0-9]{0,2}(\.[0-9]{3})*)|[0-9]*)
She accepts values such as 1
, 123
, 1.100
and 1100
and rejects values starting with zero (such as 01
and 01.200
). You can improve more if you restrict only to the valid numbers in your use cases (if you don’t have values less than 1000, for example, or if you can start with zero, if the thousands separator point is required to remove the snippet |[0-9]*
etc.).
Now just put it all together
Well, now just join all the above parts in a single regex that validates everything. Choose whether the separator will be \s+
or [ ]+
, put parentheses around the groups you want to capture and you’re done.
If you don’t want to capture any pair of parentheses, just add ?:
so that this becomes one catch group. For example, if the word PROJECT or PROTOCOLS does not interest me, I can switch to (?:PROJETO|PROTOCOLOS)
, and with that this passage will not be returned in the method groups()
.
But wait, will you want to validate the date too? After all, regex may only accept days between 1 and 31, but what if it has 31 de ABRIL
(that someone may have typed wrong)? April only has 30 days, so regex should not accept. The same goes for 29 February, which is only valid in leap years. Anyway, here has a regex to validate leap years and see that it is so complicated that it is not worth more...
Of course that you could just use (\d{4})
to capture the year, and then turn it into number and then check if it is leap. Inclusive, is the easiest, since they are just some simple calculations.
The same goes for other fields (month, project/protocol number), which is easier to extract anything that is there and validate outside of regex. And since we will validate outside the regex, why try to make a super accurate and complicated expression?
As such, we can take this idea of "extracting what you have and validating outside of regex" to the extreme by simply making a split
. As the parts you want are separated by spaces, just do:
texto = "PROJETO Nº 1.100 DE 28 DE DEZEMBRO DE 2018."
# separar a string por espaços
partes = re.split(r"\s+", texto)
I used \s+
to separate the string by spaces (one or more spaces).
Again, you can exchange for [ ]+
if you only want the white space (and do not want to consider the TAB, line breaking, etc).
With this, I get a list of the parts of the string.
Then I can see if I have all the parts (using len
to check the size of the list), and from there, I can validate each part separately.
The validation of each part is done separately, according to its specific rules. And then it can be with regex (simpler versions), or using other methods, depending on each case. Ex:
if len(partes) == 9: # tem 9 partes, OK
# como só há 2 opções, melhor fazer 2 comparações em vez de usar uma regex
if partes[0] == 'PROJETO' or partes[0] == 'PROTOCOLOS':
# primeira parte OK
... etc
And so on. For dates, you can use the module datetime
to validate them, for example (but this is already beyond the scope of the question, so I leave as "exercise for the reader" ).
Of course I could also make a compromise: a regex not so precise, but that validates a basic format. Something like the first version I put in the beginning, but with some additional capture groups to extract the most complicated validating information (date, number), and then I do the validation outside the regex.
But since most validations (by the looks of it) are easier to do outside the regex, I think the split
ends up being the easiest solution.
Another point is that in the question you mention the word PROJETOS
(plural), but in the examples has the word PROJETO
(singular). If either plural or singular, you can use (PROJETO|PROTOCOLO)S?
(The S?
indicates that the letter "S" is optional).
Anyway, the general idea does not change, just change the regex and/or the validation according to what you need.
Why in its regular expression there are several duplicated whitespace?
– Woss
The information of Projects and Protocols are in a database and unfortunately are not standardized. Being a lot of information, I’m creating an exception treatment. And thanks for the editing, it helped a lot!!
– Igor Gabriel
Is regular expression ideal for this? With it you will need to define all variations and apparently this is exactly what you want to avoid.
– Woss
Regular expression is being used to find acceptable variations and separate radicals to treat manually.
– Igor Gabriel
Try this regex:
(PROJETO|PROTOCOLOS)\s+N\s*(º|o)\s+\d+\.\d+,? {1,2}DE {1,2}\d{1,2} {1,2}DE {1,2}\w{4,8} {1,2}DE {1,2}\d{4}.
and the demo– danieltakeshi
Maybe it’s easier to do
re.split(r'\s+', texto)
to separate text by spaces ->\s+
is one or more characters that correspond to "spaces" (actually also includes TAB, line break, and others, see the documentation). Thus, you have a list of the parts of the text and can validate each individually (each with its specific rules). I think it would be clearer than a super-regex-validating-all.– hkotsubo