How do I find a character pattern corresponding to a date in a text?

Asked

Viewed 944 times

12

I have a text in a string, and I want to use some method, like the .find, to find a string in the format "dd.mm.yyyy".

I thought I’d use .find("xx.xx.xxxx") but I don’t know what to put in place "x" to generalize to any number character.

The best way is to give a read on regular expressions?

  • 1

    The best way is to give a read on regular expressions?

2 answers

20


To search for patterns in a text, a good solution is to use regular expressions. In Python, they are available in module re.

To search for the numbers, you can use the shortcut \d (that takes any digit from 0 to 9¹). And to limit the quantity, you can use the quantifiers, as {2} and {4}, meaning, respectively, "2 occurrences" and "4 occurrences". I.e., \d{2} means "2 digit occurrences from 0 to 9".

So the regex goes like this:

import re

texto = """Data 01.02.2019, outra data 20.11.2018 etc...
Outra data 15.03.1980, etc
"""

r = re.compile(r'\d{2}\.\d{2}\.\d{4}')
print(r.findall(texto))

findall returns a list of snippets found in the string. In this case, it is any snippet that corresponds to \d{2}\.\d{2}\.\d{4} (two digits, dot, two digits, dot, four digits):

['01.02.2019', '20.11.2018', '15.03.1980']

Note that the dot was written as \.. This is necessary because the dot has special meaning in regex: means "any character" (except for line breaks). That is, if the regex were \d{2}.\d{2}.\d{4}, she’d take things like 12-10#2018 or even 12a1092018, see here an example (point corresponds to any character, including letters and numbers).

In order for the point to "lose its powers" and be interpreted as a common character, it is necessary to do the escape with the \. So, \. matches only the character ., without any special meaning.


If you want, you can use finditer, that returns a iterator of pouch, that can be used to obtain more information about the parts found:

for match in r.finditer(texto):
    print("data '{}' encontrada na posicao {}".format(match.group(), match.start()))

In the example above I used the method group() to obtain the very section that was found (in this case, the date) and start(), which returns the position of the string at which the date was found:

date '01.02.2019' found in position 5
date '20.11.2018' found in position 28
date '15.03.1980' found in position 57

See the documentation for more details on the information that can be obtained from match.

We could stop here, but as it is not clear what can have in your text, I think we can improve a little more this regex.


Limiting the accepted values

\d search for any digit from 0 to 9, which means that \d{2} will accept values such as 00, 32 and 99. But these are not valid values for days and months, so we can switch to regex to limit the accepted values:

r = re.compile(r'(0[1-9]|[12]\d|3[01])\.(0[1-9]|1[0-2])\.(19|20)\d{2}')

Here we have the use of alternation (the character |, which means "or"). That is to say, abc|xyz means "abc or xyz". In the above regex, we have several cases like this to cover various possibilities of values. For example, for the day, we have 3 possibilities:

  • 0[1-9]: a zero, followed by "a digit from 1 to 9". Brackets define a character class, and the hyphen defines an interval. Therefore, [1-9] sets a character that can be any digit from 1 to 9. This ensures that the day can be 01 to 09
  • [12]\d: [12] is also a character class, but without the hyphen (and therefore without an interval). In this case, it means "the digit 1 or 2". So all this chunk means "digit 1 or 2, followed by any digit". This ensures that the day can be 10 to 29
  • 3[01]: Digit 3 followed by 0 or 1 (for days 30 and 31)

The | between these 3 expressions ensures that regex can take any of these possibilities, and the parentheses around all of this groups this whole sub-expression into one thing.

Something similar was done with the month, being that 0[1-9]|1[0-2] means:

  • 0[1-9]: zero followed by a digit of 1 to 9 (for the months of 01 to 09), or
  • 1[0-2]: digit 1 followed by a digit from 0 to 2 (for months 10, 11 and 12)

And next year, I used (19|20)\d{2}, which means "19 or 20, followed by 2 digits". That is, every year between 1900 and 2099. That is just one example, you can use \d{4} if you want (remembering that this considers values between 0000 and 9999).

There is only one detail: the parentheses form a catch group, and when that happens, findall returns a list of tuples with the groups. Using the same text from the previous example:

r = re.compile(r'(0[1-9]|[12]\d|3[01])\.(0[1-9]|1[0-2])\.(19|20)\d{2}')
print(r.findall(texto))

The exit is:

[('01', '02', '20'), ('20', '11', '20'), ('15', '03', '19')]

That is, a list of tuples, and each tuple has a separate day, month and year. In fact, the third element is only the first two digits of the year, as it is the part that is in parentheses in the regex.

To fix this, just change the parentheses to catch groups, putting ?: shortly after the (. So the regex goes like this:

r = re.compile(r'(?:0[1-9]|[12]\d|3[01])\.(?:0[1-9]|1[0-2])\.(?:19|20)\d{2}')
print(r.findall(texto))

Now yes the return will be correct:

['01.02.2019', '20.11.2018', '15.03.1980']

But still, this regex accepts some invalid dates such as 31.04.2019 (is invalid as April only has 30 days), or 29.02.2019 (invalid because 2019 is not leap year, so February 2019 has only 28 days). I don’t know how your text is or how it is generated, but there may be typos, for example, and in these cases it is worth validating the dates.

While it is possible to do a regex to validate all of this (including leap year verification), it is so complicated that it is not worth it. See here an example and try to understand it (maybe as an exercise it is interesting, but I would never use it in production):

^(?:(?:(?:0?[13578]|1[02])(\/|-|\.)31)\1|(?:(?:0?[1,3-9]|1[0-2])(\/|-|\.)(?:29|30)\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:0?2(\/|-|\.)29\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:(?:0?[1-9])|(?:1[0-2]))(\/|-|\.)(?:0?[1-9]|1\d|2[0-8])\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$

Just to understand this regex will already take a while, imagine to give maintenance...


If you want to validate dates, do it outside the regex

As we have seen, a simpler regex can bring several passages that may not be dates (such as 00.99.1224 or 31.04.2019). Already a more precise regex (like the link already mentioned above) is so complicated that - in my opinion - begins not to be worth using, because it is a maintenance nightmare.

Perhaps the best thing is a compromise: to make a regex not so complicated, not so precise, but which brings something that seem a date. Then I validate this something to make sure it is in fact a valid date. We can use the regex already seen, which checks days between 1 and 31, and months between 1 and 12, which already serves as a good starting filter.

Already to validate the most complicated cases (such as leap years and if the month has 30 or 31 days, etc), we can use the module datetime, who owns the method strptime, that turns a string into a date (that’s all it takes specify the format where this string is), and if the date is invalid, launches a ValueError. So first we can create a function that checks whether a string represents a date in the "dia.mês.ano format":

from datetime import datetime

def data_valida(data):
    try:
        datetime.strptime(data, "%d.%m.%Y")
        return True
    except ValueError:
        return False

Then we can use this function with the result of findall, and discard results that are not valid:

import re

texto = """Data 01.02.2019, outra data 20.11.2018 etc...
Outra data 15.03.1980, etc
Data inválida: 31.04.2019
"""

r = re.compile(r'(?:0[1-9]|[12]\d|3[01])\.(?:0[1-9]|1[0-2])\.(?:19|20)\d{2}')

datas = [data for data in r.findall(texto) if data_valida(data)]
print(datas)

With this, regex keeps finding invalid dates (such as 31.04.2019), but the function does an extra check. At the end, the list will only have valid dates:

['01.02.2019', '20.11.2018', '15.03.1980']

Note that to create the date list, I used the syntax of comprehensilist on, much more succinct and pythonic. But if you want, you can use one loop more "traditional" and common to other languages:

# O loop abaixo é equivalente a:
# datas = [data for data in r.findall(texto) if data_valida(data)]

datas = []
for data in r.findall(texto):
    if data_valida(data):
        datas.append(data)
print(datas)

You can still improve, of course. If your text has something like 112.12.2019 (may be a typo as it has a 1 the more at first, but it can also be, I don’t know, a specific code that coincidentally "looks" like a date), the regex will ignore the first 1 and catch the rest (12.12.2019).

If you want to ignore cases like this, we can limit to dates that are "isolated" in the text. That is, with no other alphanumeric character before or after, and we can do this using the shortcut \b (also called word Boundary, something like "boundary between words" - here has a more detailed explanation):

r = re.compile(r'\b(?:0[1-9]|[12]\d|3[01])\.(?:0[1-9]|1[0-2])\.(?:19|20)\d{2}\b')

With that, cases like 112.12.2019 are ignored by regex.

I only made all these suggestions because it was not clear what might be in your text. Depending on how varied your data is, you can adjust the complexity of regex. The important thing is to make sure she gets what you need, and - not least - if she doesn’t get what you don’t need.

But since the "don’t get what you don’t need" part is more complicated, I still find it interesting to validate the dates outside the regex, to make sure they are even dates. But it’s up to you whether you do it or not.


(1) In Python 3, the digits accepted for \d are any character of the Unicode category "Number, Decimal Digit", which includes characters such as ٠١٢٣٤٥٦٧٨٩, among others (see this answer for more details, and this example to better understand).

If you want the regex to take only digits from 0 to 9 (ignoring the other characters as ٠١٢٣٤٥٦٧٨٩), you can exchange \d for [0-9], or still use the flag ASCII (see here the difference):

# para \d não considerar outros caracteres de dígitos, use a flag ASCII
r = re.compile(r'\b(?:0[1-9]|[12]\d|3[01])\.(?:0[1-9]|1[0-2])\.(?:19|20)\d{2}\b',
               flags = re.ASCII)

# ou troque \d por [0-9]
r = re.compile(r'\b(?:0[1-9]|[12][0-9]|3[01])\.(?:0[1-9]|1[0-2])\.(?:19|20)[0-9]{2}\b')

Of course, if you are working with texts in Portuguese, it is a more difficult situation to happen, but anyway, the tip.

  • 1

    Usually I stick to good practices and use only the vote as a way to gratify a good response. But not this time. God bless that answer. Great! Thank you so much for this.

  • @Caiodepaulasilva I thank you for your comment! At first I was just going to put the regex and some examples, but I thought it was worth going a little deeper into the subject (even because dates are more complicated than they seem). Good to know that more people recognize the importance of not just staying in the basics and delving into the details :-)

  • 1

    Perfect. Many thanks <3

3

TL;DR

Using regex:

import re
texto = '''
Data de fabricação: 20.02.2019
Validade: 30.12.2099
'''

print(re.findall('\d{2}\.\d{2}\.\d{4}',texto))

Output:

['20.02.2019', '30.12.2099']

See working on repl.it.

  • 2

    Do not forget to prefix the stream string with r' - in that case, it happened from the sequences of \ that you use in regex not to be replaced by Python.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.