Data munging with python

Question

Data munging with python

Asked 5 years, 4 months ago

Viewed 101 times

1

Are the following date formats:

042555

04/25/1955

Abril 25, 1955

How to use regex to transform a date format into each other (would be 6 transformations)?

For example:

Entrada: 042555

Saída: 04/25/1955 e  Abril 25, 1955

I did:

import re
#re.sub(padrão, string)

ex1 = "04/25/1955"

pattern1 = r"\d{1,2}/\d{1,2}/\d{4}"

print(f"macth :{re.findall(pattern1 ,ex1)}")

ex2 = "042555"

pattern2 =r"\d{1,2}\d{1,2}\d{2}"

print(f"macth : {re.findall(pattern2 ,ex2)}")


ex3 = "April 25, 1955"
pattern3 = r"[A-Za-z]{2,}\s\d+,\s\d{4}"

print(f"macth : {re.findall(pattern3 ,ex3)}")

Need to do using Regex?

– Augusto Vasques

2020/03/29 at 23:13
@Augusto Vasques: Preferably... But I would like to show you other solutions as well...

– Laurinda Souza

2020/03/30 at 11:12
I included below a solution with regex and other without (and I say that the solution without regex is better, and in the answer I explain the reasons) :-)

– hkotsubo

2020/03/30 at 17:08

1 answer

Browser other questions tagged python python-3.x regex date datetime

You are not signed in. Login or sign up in order to post.

by hkotsubo • **55,826** points · Answer 1 · 2020-03-30T11:17:53+00:00

If you want to work with dates, don’t use regex, use a date API.
Dates are more complex than they appear and have rules that are difficult to validate only with a regular expression.

For example, you used \d{1,2}, which accepts any value between 0 and 99, then the program will mistakenly accept invalid dates such as "00/99/0000". But the months can only be 1 to 12, and they can be 28, 29, 30 or 31 days. For February is even worse, because the amount of days depends on whether the year is leap or not. All these checks can even be done with regex, but it will get so complicated that it’s not worth it (just to get an idea, see some examples here and here).

And for the name of the month you used [A-Za-z]{2,}, who will accept whichever combination of 2 or more letters (ie will not accept only valid months, but rather anything, like "abc" or "Xyz"). In this case is relatively easy to make a regex that only accepts valid month names, but still you need to match the other rules already cited (maximum amount of days, leap year, etc).

In Python you can use the module datetime to convert the string to a date, and then convert it back to strings in other formats. Just use the available formats.

In your case, a solution would be:

from datetime import datetime

texto = '04/25/1955'
formatos = [ '%m%d%y', '%m/%d/%Y', '%B %d, %Y' ]
formato_encontrado = None
for formato in formatos:
    try:
        data = datetime.strptime(texto, formato)
        formato_encontrado = formato
        break
    except ValueError:
        pass # formato não corresponde à string, nesse caso não faço nada

if formato_encontrado is not None: # se encontrou um formato, converte a data para os outros formatos
    for formato in formatos:
        if formato != formato_encontrado:
            print(datetime.strftime(data, formato))

I create a list of all possible formats, and try to convert the text to date, using strptime. If the text does not match the format, a ValueError and in this case I’m not doing anything (but you can make some decision there in the block except, how to print the error or take the action you want) and try the next format.

If the format was found, I convert the date to the other formats, using strftime. The exit code above is:

042555
April 25, 1955

But attention! There are still some details to get right.

For the first case, which has the year with 2 digits (55), you want the year to be 1955. But according to the documentation, values between 0 and 68 are mapped for the years 2000 to 2068, and values between 69 and 99 are mapped for the years 1969 and 1999. That is, the above code converts "042555" to April 25 2055.

In this case, you must set a "court" date to consider whether it is in the 20th or 21st century. For example, if I want values over 50 to be in 1900, just do something like:

def year_2to4_digit(two_digit_year, pivotyear = 1950):
    century = (pivotyear // 100) * 100
    if century + two_digit_year > pivotyear:
        return century + two_digit_year
    return century + 100 + two_digit_year

...
if formato == '%m%d%y':
    data = data.replace(year = year_2to4_digit(data.year % 100))

That is, values between 51 and 99 would be converted for the years between 1951 and 1999. The other values (0 to 50) would be between 2000 and 2050. If you want another "cut date", just change the parameter pivotyear in function year_2to4_digit (code-based in this answer of Soen, and in this link you also find other options).

For the name of the month, in your code, the entry text has the month in English ("April"), but you said you want the output in Portuguese. By default, Python uses the English names, so you need to use the module locale to change the language:

import locale
locale.setlocale(locale.LC_ALL, 'pt_BR.utf8')

If both input and output must be in the same language, you can set the locale once at the beginning of the program. But if the input is in English and the output in Portuguese, an alternative would be to set the locale only for strftime, and then reset it to the original configuration (not to affect the rest of the program, for example). The complete code looks like this:

from datetime import datetime
import locale

def year_2to4_digit(two_digit_year, pivotyear = 1950):
    century = (pivotyear // 100) * 100
    if century + two_digit_year > pivotyear:
        return century + two_digit_year
    return century + 100 + two_digit_year

texto = '042555'
formatos = [ '%m%d%y', '%m/%d/%Y', '%B %d, %Y' ]
formato_encontrado = None
for formato in formatos:
    try:
        data = datetime.strptime(texto, formato)
        formato_encontrado = formato
        if formato == '%m%d%y':
            data = data.replace(year = year_2to4_digit(data.year % 100))
        break
    except ValueError:
        pass # formato não corresponde à string, nesse caso não faço nada

if formato_encontrado is not None: # se encontrou um formato, converte a data para os outros formatos
    for formato in formatos:
        if formato != formato_encontrado:
            if formato == '%B %d, %Y': # mudar para o locale português
                locale_original = locale.getlocale(locale.LC_ALL)
                locale.setlocale(locale.LC_ALL, 'pt_BR.utf8')
            print(datetime.strftime(data, formato).capitalize())
            if formato == '%B %d, %Y': # voltar para o locale padrão
                locale.setlocale(locale.LC_ALL, locale_original)
else:
    print(f'{texto} não corresponde a nenhum dos formatos aceitos')

Exit:

04/25/1955
abril 25, 1955

In my case, the output in Portuguese was "April 25, 1955", that is, the name of the month with lowercase letter. This configuration depends on the locale, but you can do something like:

print(datetime.strftime(data, formato).capitalize())

That the output will be "April 25, 1955". In the case of numbers, the method capitalize makes no difference, as digits do not have a "uppercase" version and are not changed.

It is also worth remembering that for the name of the month in Portuguese, you must have the respective locale installed on your system, as explained in this answer.

If you really want to use regex

A solution would be:

import re

regex_dia = r'(?:0[1-9]|[12]\d|3[01])'
regex_mes = r'(?:0[1-9]|1[0-2])'
regex_ano = r'(?:19|20)\d{2}' # aceita apenas anos entre 1900 e 2099
regex_ano2 = r'\d{2}' # ano com 2 dígitos
# mês em texto
regex_mes_txt = '(?:Jan|Febr)uary|Ma(?:rch|y)|April|Ju(?:ne|ly)|August|(?:Octo|(?:Sept|Nov|Dec)em)ber'
# ou, se quiser em português
#regex_mes_txt = '(?:jan|fever)eiro|ma(?:rç|i)o|abril|ju[ln]ho|agosto|(?:(?:set|nov|dez)em|outu)bro'

def get_parser(*regexes, sep='', sep2=''):
    if sep2:
        exp = f'({regexes[0]}){sep}({regexes[1]}){sep2}({regexes[2]})'
    else:
        exp = sep.join(map(lambda s: f'({s})', regexes))
    r = re.compile(f'^{exp}$')
    def parse(texto):
        m = r.match(texto)
        if not m:
            return None
        return m.group(1, 2, 3)
    return parse

parsers = [
    get_parser(regex_mes, regex_dia, regex_ano2, sep=''),
    get_parser(regex_mes, regex_dia, regex_ano, sep='/'),
    get_parser(regex_mes_txt, regex_dia, regex_ano, sep=' ', sep2=', ')
]

# nomes dos meses em português
meses_pt = ['Janeiro', 'Fevereiro', 'Março', 'Abril', 'Maio', 'Junho', 'Julho', 'Agosto', 'Setembro', 'Outubro', 'Novembro', 'Dezembro']
# nomes dos meses em inglês
meses_en = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']

formatters = [
    lambda dia, mes, ano: f'{mes}{dia}{int(ano) % 100}',
    lambda dia, mes, ano: f'{mes}/{dia}/{ano}',
    # pode trocar para meses_en se quiser o mês em inglês
    lambda dia, mes, ano: f'{meses_pt[int(mes) - 1]} {dia}, {ano}'
]

def year_2to4_digit(two_digit_year, pivotyear = 1950):
    century = (pivotyear // 100) * 100
    if century + two_digit_year > pivotyear:
        return century + two_digit_year
    return century + 100 + two_digit_year

texto = '042555'
formato_encontrado = -1
for i, parser in enumerate(parsers):
    dados = parser(texto)
    if dados is not None:
        mes, dia, ano = dados
        formato_encontrado = i
        if i == 0:
            ano = year_2to4_digit(int(ano))
        elif i == 2:
            mes = f'{meses_en.index(mes) + 1:02d}'
        break

if formato_encontrado >= 0: # se encontrou um formato, converte a data para os outros formatos
    for i, formatter in enumerate(formatters):
        if i != formato_encontrado:
            print(formatter(dia, mes, ano))
else:
    print(f'{texto} não corresponde a nenhum dos formatos aceitos')

But in addition to getting much more complicated, this solution lets pass several invalid dates, such as April 31st or February 29th in non-leap years (and in the links already indicated above you saw how a regex to validate these things is quite complicated).

The above example was just to show that even an incomplete and limited example gets more complicated than using the date API (which in addition to making the code simpler, still validates all cases that regex cannot detect). Again, reinforcement: if you want to work with dates, use a date API. Regex is not always the best solution.