Compare dates - Text Mining

Asked

Viewed 81 times

-2

I made a code for texte Mining, but I’m having difficulty comparing if the date that was identified has already passed + 1 year. I would like a help or a vision that can help in my understanding.

import re
from pprint import pprint
from openpyxl import load_workbook
from openpyxl import Workbook

caminho = './amostra.xlsx
arquivo_excel = load_workbook(caminho)
planilha = arquivo_excel['Plan1']

for row in planilha.values:
    # VERIFICA O TIPO (J OU F)
    if(type(row[1]) is str ):
        tamanho_j = row[52] == 'J' and len(row[83]) <= 250
        tamanho_f = row[52] == 'F' and len(row[83]) <= 100

        if(tamanho_f or tamanho_j):
            # VERIFICAR SE TEM DATA ( aqui no caso está minha dúvida para comparar a data)
            result = re.search(r'\d{2}/\d{2}/\d{2}', row[83])
            if(result != None):
                    #VERIFICA AS PALAVRAS CHAVES
                if (row[83] in ['Visitado', 'Visitada','Fachada', 'Fachadas', 'Box', 'instalações' ,
                 'instalação','instalada', 'instalado','estoque', 'funcionários', 'funcionário','layout']):
                        print(row[83])
                else:
                    print("não há palavras")
            else:
                print("SEM DATA")      
        else:
            print("ACIMA DOS CARACTERES")




        if(type(row[2]) is str):            
            tamanho_j = len(row[84]) <= 100
            tamanho_f = len(row[84]) <= 50

2 answers

1

Using the function https://docs.python.org/2/library/datetime.html#datetime.datetime.strptime to "parse" the date received

And use the https://docs.python.org/3/library/datetime.html#timedelta-Objects to check the difference of years in a comparison of two datetime

Remember that your year has only 2 digits, which is bad, prefer to always use the full year, 4 digits (if there are humans until the year 10.000 ai will be 5 digits :) and use the result.group(0)

Should stay like this:

from datetime import datetime

...

result = re.search(r'(\d{2})/(\d{2})/(\d{2})', row[83])

if(result != None):
    ano = result.group(0)

    data_atual = datetime.strptime(ano, "%d/%m/%y")

    if (datetime.now() - data_atual).days > 365:
        print("A data tem mais de um ano")
    elif (row[83] in ['Visitado', 'Visitada','Fachada', 'Fachadas', 'Box', 'instalações' ,
     'instalação','instalada', 'instalado','estoque', 'funcionários', 'funcionário','layout']):
            print(row[83])
    else:
        print("não há palavras")

Check your files, because if the year is already complete then YOUR script is wrong and regex should be 4 in the last group:

 r'(\d{2})/(\d{2})/(\d{4})'

If you have four digits, change the %y in minuscule by %Y capitalized

Remember 365 is not quite a whole year, maybe more, it depends a lot on understanding how the calendar works, but this is a little complicated to explain


An example of simple comparison for testing would be this:

from datetime import datetime

import re

# Com 2 digitos em ano

data_inserida = "teste 15/03/20 teste"

result = re.search(r'(\d{2})/(\d{2})/(\d{2})', data_inserida)

fulldate = result.group(0)

data_atual = datetime.strptime(fulldate, "%d/%m/%y")

if (datetime.now() - data_atual).days > 365:
   print("mais de um ano", fulldate)
else:
   print("menos de um ano", fulldate)

# Com 4 digitos em ano

data_inserida = "teste 15/03/2010 teste"

result = re.search(r'(\d{2})/(\d{2})/(\d{4})', data_inserida)

fulldate = result.group(0)

data_atual = datetime.strptime(fulldate, "%d/%m/%Y")

if (datetime.now() - data_atual).days > 365:
   print("mais de um ano", fulldate)
else:
   print("menos de um ano", fulldate)

Online test: https://repl.it/@inphinit/how-old-is-a-date

  • Got it, perfect. Thank you!

  • Dear @pauloalmansa test first calmly, I just corrected the first code, copy it again and test, so let me know if it’s okay or not.

  • I took the concept and managed to implement, is that actually the file reads an excel file and it checks if there is a date. If it is longer than 1 year that is registered, it will not pass the test and will return me the error. If it is within 1 year, it passes to the next tests. Thanks for the help.

  • @pauloalmansa blz, mark the answer as correct please.

  • @fernandosavio was vestigios of the first response, I did not remove, thanks. I’ve edited.

  • @fernandosavio was distracted, just now noticed, group(0) already solved everything without need to concatenate, adjusted to simplify

Show 1 more comment

1


What’s in row[83]?

If it is exactly the date in the "dd/mm/yy" format and nothing else, so you don’t even need to regex, you can do the Parsing directly:

from datetime import datetime

data = datetime.strptime('15/03/20', '%d/%m/%y')
hoje = datetime.now()
if (hoje - data).days > 365:
    # já se passou 1 ano

One detail is that the year is 2 digits, and the documentation says that values between 0 and 68 are mapped for the years 2000 to 2068, and values between 69 and 99 are mapped for the years 1969 and 1999. If you want to change this rule, see this answer (in the part that talks about changing the pivot year).


Now, if row[83] has a text, and in the middle of it has a date, so it makes more sense to use regex:

import re
from datetime import date

texto_contendo_data = 'abc 15/03/19 123 bla'
match = re.search(r'(\d{2})/(\d{2})/(\d{2})', texto_contendo_data)
if match:
    dia, mes, ano = map(int, match.groups())
    d = date(2000 + ano, mes, dia)
    if (date.today() - d).days > 365:
        # já se passou 1 ano

I considered that the year with 2 digits always refers to the 21st century (i.e., year "10" refers to 2010, year "90" is 2090, etc). I also used date instead of datetime, because here we only have the day, month and year, and the time doesn’t seem to matter in calculating the difference.

If regex found a date in the given format, I turn the found snippets into numbers, using map and int, and then use these numeric values to build the date (you don’t need to build a string just to do the Parsing again, you can create the date directly from the values found).

Note that regex has parentheses: they form capture groups, that I can recover after with the method groups. As I know that there are 3 groups, I can now assign the value directly to the variables dia, mes and ano.

Remembering that regex will find any numbers, so if you have "99/99/99" in the text, it tries to create this date. Of course you can use a more elaborate regex if you like (see some options here), but anyway, strptime can give error if the date is invalid. If you want to validate this too, put the code inside a block try/except:

try:
    # tenta criar a data aqui dentro
except ValueError as e:
    # erro ao criar a data
    print(e) # se quiser imprimir a mensagem de erro para saber o que aconteceu
  • In Row[83] it would be the column of where I want you to do the mining in the text and find the date, so I used the regex in the case, only I was not able to do this check if you passed more than 1 if you passed to perform the test. But thank you!

  • date.Today vs datetime.now from the documentation: https://docs.python.org/3/library/datetime.html#datetime.datetime.now in an example without TZ is pretty much the same, but I don’t see pq importing 2 different modules, unless it depends on something else in the date module (I refer only to your first code)

  • @Guilhermenascimento Yeah, I thought considering or not the time could make a difference (because there are cases where it does), but for this specific case does not because the date to be read does not have the time so it will always be midnight. I changed the first code, thank you!

  • 1

    Yes, this hours did not speak explicitly, but it was what I had already answered, 365 will not be a whole year indeed, not only for the lack of hours informed, but for the simple fact that each year there may be variations of seconds and minutes and even other variations, then in case it is only 365 based from what is informed and nor has to force the script to live on guesswork ;)

Browser other questions tagged

You are not signed in. Login or sign up in order to post.