Regular expression that finds patterns in which some terms can change?

Asked

Viewed 382 times

0

I need to find a pattern within a text. Example:

PROJETO  Nº 1.100  DE 28 DE DEZEMBRO DE 2018.

Within this pattern, the word PROJETOS may also be PROTOCOLOS and the character º can be o. Thus, the expression must find:

PROJETO  Nº 1.100  DE 28 DE DEZEMBRO DE 2018.
PROJETO  No 1.100  DE 28 DE DEZEMBRO DE 2018.
PROTOCOLOS  Nº 1.100  DE 28 DE DEZEMBRO DE 2018.
PROTOCOLOS  No 1.100  DE 28 DE DEZEMBRO DE 2018.

I tried this way:

re = r"(PROJETO|PROTOCOLOS)\s+N\s+\w\s+(\d\.?)+,?  DE  \d{1,2}  DE  \w{4,8}  DE  \d{4}."

Obs: Date may or may not have more than one space

  • Why in its regular expression there are several duplicated whitespace?

  • The information of Projects and Protocols are in a database and unfortunately are not standardized. Being a lot of information, I’m creating an exception treatment. And thanks for the editing, it helped a lot!!

  • 1

    Is regular expression ideal for this? With it you will need to define all variations and apparently this is exactly what you want to avoid.

  • Regular expression is being used to find acceptable variations and separate radicals to treat manually.

  • Try this regex: (PROJETO|PROTOCOLOS)\s+N\s*(º|o)\s+\d+\.\d+,? {1,2}DE {1,2}\d{1,2} {1,2}DE {1,2}\w{4,8} {1,2}DE {1,2}\d{4}. and the demo

  • 1

    Maybe it’s easier to do re.split(r'\s+', texto) to separate text by spaces -> \s+ is one or more characters that correspond to "spaces" (actually also includes TAB, line break, and others, see the documentation). Thus, you have a list of the parts of the text and can validate each individually (each with its specific rules). I think it would be clearer than a super-regex-validating-all.

Show 1 more comment

2 answers

4


Short answer

Use re.split instead of a super-regex-do-it-all.

Long answer

You can, if you want, try to make a single regex that solves everything (although I don’t think it’s the best solution, keep reading and understand the reasons). But anyway, a first attempt would be something like:

import re

r = re.compile(r"^(PROJETO|PROTOCOLOS)\s+N\s*(º|o)\s+\d+\.\d+,?\s+DE\s+\d{1,2}\s+DE\s+\w{4,9}\s+DE\s+\d{4}.$")

I used the markers ^ (string start) and $ (end of string) to delimit that the string can only have what I put in regex. If you don’t use them, regex can give match in strings that have characters before or after the text you want (if this is the case, just remove the ^ and $).

I also use \s+ to spaces. This expression means "one or more spaces", and "spaces" can be: the white space itself, a TAB, line breaking (\n), among others (see the complete list of characters that \s consider reading the documentation - remembering that this list may vary according to the language and the way regex is created).

If you want only white space, you can use a white space before the +, instead of \s+, or [ ]+, which in my opinion is a little less confusing, as you can notice that there is a space inside the brackets. Using something like etc +, at least for me, it’s not so clear that there’s a gap between the c and the + (in larger expressions I have this problem, but there is a matter of taste, there are people who use so without problem).

The brackets define a character class and take whatever is inside them. That is to say, [ ] will consider only the space (and not the other characters that \s considers). It is usually redundant to use the brackets for a single character, but in the case of space, it may give more readability, as already explained.


This expression looks like that considers everything, but there are some problems in her.

You said in the comments who wants to "separate the radicals to handle manually". Then you would have to put parentheses around the snippets you want to capture, so that these excerpts can be recovered later. In the above regex, there are only two pairs of parentheses: (PROJETO|PROTOCOLOS) and (º|o). This means that only these snippets will be available:

r = re.compile(r"^(PROJETO|PROTOCOLOS)\s+N\s*(º|o)\s+\d+\.\d+,?\s+DE\s+\d{1,2}\s+DE\s+\w{4,9}\s+DE\s+\d{4}.$")
textos = [
    "PROJETO Nº 1.100 DE 28 DE DEZEMBRO DE 2018.",
    "PROJETO No 1.100 DE 28 DE DEZEMBRO DE 2018.",
    "PROTOCOLOS Nº 1.100 DE 28 DE DEZEMBRO DE 2018.",
    "PROTOCOLOS No 1.100 DE 28 DE DEZEMBRO DE 2018."
]
for texto in textos:
    for match in r.finditer(texto):
        print(match.groups()) # imprimir os grupos de captura

The output of this code is:

('PROJETO', 'º')
('PROJETO', 'o')
('PROTOCOLOS', 'º')
('PROTOCOLOS', 'o')

Note that only the sections that were in parentheses were captured. If you want the first group, just do match.group(1) (match.group(0) returns the entire string, which corresponds to match found by regex).

If you want to capture other parts of regex, put them in parentheses as well, and they will be available on match. For example, to capture the project number and date fields:

r = re.compile(r"^(PROJETO|PROTOCOLOS)\s+N\s*(º|o)\s+(\d+\.\d+),?\s+DE\s+(\d{1,2})\s+DE\s+(\w{4,9})\s+DE\s+(\d{4}).$")

With this, the groups will be:

('PROJETO', 'º', '1.100', '28', 'DEZEMBRO', '2018')
('PROJETO', 'o', '1.100', '28', 'DEZEMBRO', '2018')
('PROTOCOLOS', 'º', '1.100', '28', 'DEZEMBRO', '2018')
('PROTOCOLOS', 'o', '1.100', '28', 'DEZEMBRO', '2018')

Other problems

Notice I changed the name of the month to \w{4,9} (4 to 9 characters), as February has 9 letters (using {4,8}, he would be out). The problem is that \w also accepts numbers and the character _, then abc_123 would be considered a valid month.

Another point is the month of Marco. In Python 2 the \w only considers the ç if the flag UNICODE is set (see documentation and a example). Already in the Python 3, the ç is considered without needing the flag (example).

It is also not clear if the name of the month is always uppercase. Anyway, \w accepts lower case letters, so you might want to be more specific and use [A-ZÇ]{4,9}, for example (only letters of A to Z or Ç). Only still, this expression will still accept strings as ABCDEF or even ÇÇÇÇÇ (see here).

So the best thing would be to have something more specific, like:

JANEIRO|FEVEREIRO|MARÇO|ABRIL|MAIO|JUNHO|JULHO|AGOSTO|SETEMBRO|OUTUBRO|NOVEMBRO|DEZEMBRO

Or, if want to shorten a little:

(JAN|FEVER)EIRO|MA(RÇ|I)O|ABRIL|JU[NL]HO|AGOSTO|(SETEM|OUTU|NOVEM|DEZEM)BRO

For the days, use \d{1,2} will accept any value between 0 and 99, then maybe it’s better to change to something that accepts only values from 1 to 31. For example:

(trecho antes do dia) ... DE\s+(3[01]|[12]\d|[1-9])\s+DE ... (trecho depois do dia)

The same thing next year, because \d{4} will accept from 0000 until 9999. You can switch to something like (19|20)\d{2}, which accepts years between 1900 and 2099. Or put even more complicated things to further restrict this range.

And for the protocol/project number, will the number always have the point as separator of thousands? Can you have cases of numbers smaller than 1000 (and therefore without the point)? I believe you can’t start with zero (01.200 would be invalid, right? ). Anyway, an option would be this regex:

[1-9](([0-9]{0,2}(\.[0-9]{3})*)|[0-9]*)

She accepts values such as 1, 123, 1.100 and 1100 and rejects values starting with zero (such as 01 and 01.200). You can improve more if you restrict only to the valid numbers in your use cases (if you don’t have values less than 1000, for example, or if you can start with zero, if the thousands separator point is required to remove the snippet |[0-9]* etc.).


Now just put it all together

Well, now just join all the above parts in a single regex that validates everything. Choose whether the separator will be \s+ or [ ]+, put parentheses around the groups you want to capture and you’re done.

If you don’t want to capture any pair of parentheses, just add ?: so that this becomes one catch group. For example, if the word PROJECT or PROTOCOLS does not interest me, I can switch to (?:PROJETO|PROTOCOLOS), and with that this passage will not be returned in the method groups().

But wait, will you want to validate the date too? After all, regex may only accept days between 1 and 31, but what if it has 31 de ABRIL (that someone may have typed wrong)? April only has 30 days, so regex should not accept. The same goes for 29 February, which is only valid in leap years. Anyway, here has a regex to validate leap years and see that it is so complicated that it is not worth more...

Of course that you could just use (\d{4}) to capture the year, and then turn it into number and then check if it is leap. Inclusive, is the easiest, since they are just some simple calculations.

The same goes for other fields (month, project/protocol number), which is easier to extract anything that is there and validate outside of regex. And since we will validate outside the regex, why try to make a super accurate and complicated expression?

As such, we can take this idea of "extracting what you have and validating outside of regex" to the extreme by simply making a split. As the parts you want are separated by spaces, just do:

texto = "PROJETO Nº 1.100 DE 28 DE DEZEMBRO DE 2018."
# separar a string por espaços
partes = re.split(r"\s+", texto)

I used \s+ to separate the string by spaces (one or more spaces).
Again, you can exchange for [ ]+ if you only want the white space (and do not want to consider the TAB, line breaking, etc).

With this, I get a list of the parts of the string. Then I can see if I have all the parts (using len to check the size of the list), and from there, I can validate each part separately.

The validation of each part is done separately, according to its specific rules. And then it can be with regex (simpler versions), or using other methods, depending on each case. Ex:

if len(partes) == 9: # tem 9 partes, OK
    # como só há 2 opções, melhor fazer 2 comparações em vez de usar uma regex
    if partes[0] == 'PROJETO' or partes[0] == 'PROTOCOLOS':
        # primeira parte OK
    ... etc

And so on. For dates, you can use the module datetime to validate them, for example (but this is already beyond the scope of the question, so I leave as "exercise for the reader" ).


Of course I could also make a compromise: a regex not so precise, but that validates a basic format. Something like the first version I put in the beginning, but with some additional capture groups to extract the most complicated validating information (date, number), and then I do the validation outside the regex.

But since most validations (by the looks of it) are easier to do outside the regex, I think the split ends up being the easiest solution.

Another point is that in the question you mention the word PROJETOS (plural), but in the examples has the word PROJETO (singular). If either plural or singular, you can use (PROJETO|PROTOCOLO)S? (The S? indicates that the letter "S" is optional).

Anyway, the general idea does not change, just change the regex and/or the validation according to what you need.

3

Try this regex (PROJETO|PROTOCOLOS) *?N(º|o) *?[0-9.]+ *DE *\d+ DE \b.*\b DE \d*\. with the global flag, which default is executed in the method regex.finditer(string)

So your code would look like this:

import re

test_str = ("PROJETO Nº 1.100 DE 28 DE DEZEMBRO DE 2018.\n"
"PROJETO No 1.100 DE 28 DE DEZEMBRO DE 2018.\n"
"PROTOCOLOS Nº 1.100 DE 28 DE DEZEMBRO DE 2018.\n"
"PROTOCOLOS No 1.100 DE 28 DE DEZEMBRO DE 2018.")
pattern = r'(PROJETO|PROTOCOLOS) *?N(º|o) *?[0-9.]+ *DE *\d+ DE \b.*\b DE \d*\.'
regex = re.compile(pattern)
for match in regex.finditer(test_str):
    print "%s" % (match.group(0))

Explanation by Regex:

  • (PROJETO|PROTOCOLOS) - Capture the sequence: PROJECT or PR sequence

OTOCOLOS

  • * - Captures from 0 to infinite whitespace (not to see right with code formatting, regex content is " *").

  • N(º|o) - Capture N with º or o next.

  • * - Capture 0 to infinite whitespace.

  • [0-9.]+ - Captures a combination that has numbers and points (.).
  • * - Capture 0 to infinite whitespace.
  • DE - Captures exactly the sequence "OF".
  • * - Capture 0 to infinite whitespace.
  • \d+ - Captures 1 or more digits.
  • DE - Captures exactly the sequence " DE ".
  • \b.*\b - Captures a word (any string that has at the beginning and at the end the string, the beginning of the string or spaces).
  • DE - Captures exactly the sequence " DE ".
  • \d* - Captures 0 or infinite digits.
  • \. - Captures the end point of the sentence

You can also see how it works and test other regex patterns here.

OBS: The regex is as generic as possible, if you have any case where there was an unexpected result let me know which one can make the regex more assertive.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.