Regex to validate certain date format

Asked

Viewed 2,037 times

3

I was modifying a regex for a program in c++ that would validate the following date input form 29/feb/2000. Currently she was only accepting 29/02/2000 or 30/03/2017.

I tried to add for the other months but I’m not getting it. How to make it possible 30/mar/2017 or 20/dec/2018?

Follows the regex:

"^(?:(?:0[1-9]|1[0-9]|2[0-8])(?:/|.|-)(?:0[1-9]|1[0-2])|(?:(?:29|30)(?:/|.|-)(?:0[13456789]|1[0-2]))|(?:31(?:/|.|-)(?:0[13578]|1[02])))(?:/|.|-)(?:[2-9][0-9]{3}|1[6-9][0-9]{2}|159[0-9]|158[3-9])|29(?:/|.|-)(?:02|feb|Feb)(?:/|.|-)(?:(?:[2-9](?:04|08|[2468][048]|[13579][26])|1[6-9](?:(?:04|08|[2468][048]|[13579][26])00)|159(?:2|6)|158(?:4|8))|(?:16|[2468][048]|[3579][26])00)$"
  • Boost has some interesting libraries, I suggest using them for validation: Format Date Parser. Because Regex would need something more complex to validate for example: 29/02/2018 or 29/02/2020. Where the first is invalid and the second is valid.

  • 1

    The comments made in the replies of your other question apply here, as Jsbueno himself pointed out

3 answers

6


Regex is definitely not the right tool to solve this problem. However, I read in the comments that you are studying regex... So just for fun


Regex

^(?:(?:(0?[1-9]|1\d|2[0-8])([-/.])(0?[1-9]|1[0-2]|j(?:an|u[nl])|ma[ry]|a(?:pr|ug)|sep|oct|nov|dec|feb)|(29|30)([-/.])(0?[13-9]|1[0-2]|j(?:an|u[nl])|ma[ry]|a(?:pr|ug)|sep|oct|nov|dec)|(31)([-/.])(0?[13578]|1[02]|jan|ma[ry]|jul|aug|oct|dec))(?:\2|\5|\8)(0{2,3}[1-9]|0{1,2}[1-9]\d|0?[1-9]\d{2}|[1-9]\d{3})|(29)([-/.])(0?2|feb)\12(\d{1,2}(?:0[48]|[2468][048]|[13579][26])|(?:0?[48]|[13579][26]|[2468][048])00))$

⟶       ⟶       ⟶       ⟶         ⟶         ⟶           ⟶                 ⟿                    

Let’s see the Debuggex to unroll:

Debuggex

Or explained with variables:

std::string regexData() {
    std::string
        sep                = "/",

        dia1a28            = "(0?[1-9]|1\\d|2[0-8])",
        dia29              = "(29)",
        dia29ou30          = "(29|30)",
        dia31              = "(31)",

        mesFev             = "(0?2|feb)",
        mes31diasNum       = "0?[13578]|1[02]",
        mes31diasNome      = "jan|ma[ry]|jul|aug|oct|dec",
        mes31dias          = "("+mes31diasNum+"|"+mes31diasNome+")",
        mesNaoFevNum       = "0?[13-9]|1[0-2]",
        mesNaoFevNome      = "j(?:an|u[nl])|ma[ry]|a(?:pr|ug)|sep|oct|nov|dec",
        mesNaoFev          = "("+mesNaoFevNum+"|"+mesNaoFevNome+")",
        mesTudoNum         = "0?[1-9]|1[0-2]",
        mesTudoNome        = mesNaoFevNome+"|feb",
        mesTudo            = "("+mesTudoNum+"|"+mesTudoNome+")",

        diames29Fev        = dia29+sep+mesFev,
        diames1a28         = dia1a28+sep+mesTudo,
        diames29ou30naoFev = dia29ou30+sep+mesNaoFev,
        diames31           = dia31+sep+mes31dias,
        diamesNao29Feb     = "(?:"+diames1a28+"|"+diames29ou30naoFev+"|"+diames31+")",

        ano001a9999        = "(0{2,3}[1-9]|0{1,2}[1-9]\\d|0?[1-9]\\d{2}|[1-9]\\d{3})",
        anoX4nao100        = "\\d{1,2}(?:0[48]|[2468][048]|[13579][26])",
        anoX400            = "(?:0?[48]|[13579][26]|[2468][048])00",
        anoBissexto        = "("+anoX4nao100+"|"+anoX400+")",

        dataNao29Fev       = diamesNao29Feb+sep+ano001a9999,
        data29Fev          = diames29Fev+sep+anoBissexto,

        dataFinal          = "(?:"+dataNao29Fev+"|"+data29Fev+")";
    return dataFinal;
}


Using different date separators

You can use something like:

^(dia)[-/.](mês)[-/.](ano)$
dia = match[1]; mes = match[2]; ano = match[3];

But that would allow a date like 1.2/2000.

To force a match using the even separator, you must use a group to catch the first and, in the second, use a rearview mirror (backreference) to match the text captured by that group:

^(dia)([-/.])(mês)\2(ano)$
dia = match[1]; mes = match[3]; ano = match[4];


Code

#include <iostream>
#include <regex>

int main() {
    constexpr char text[]{"29/feb/2020"};
    std::regex re(R"((?:(?:(0?[1-9]|1\d|2[0-8])([-/.])(0?[1-9]|1[0-2]|j(?:an|u[nl])|ma[ry]|a(?:pr|ug)|sep|oct|nov|dec|feb)|(29|30)([-/.])(0?[13-9]|1[0-2]|j(?:an|u[nl])|ma[ry]|a(?:pr|ug)|sep|oct|nov|dec)|(31)([-/.])(0?[13578]|1[02]|jan|ma[ry]|jul|aug|oct|dec))(?:\2|\5|\8)(0{2,3}[1-9]|0{1,2}[1-9]\d|0?[1-9]\d{2}|[1-9]\d{3})|(29)([-/.])(0?2|feb)\12(\d{1,2}(?:0[48]|[2468][048]|[13579][26])|(?:0?[48]|[13579][26]|[2468][048])00)))");
    std::cmatch match;
    bool valid = std::regex_match(text, match, re);

    if (valid) {
        std::cout << "Data válida: " << match[0] << std::endl
                  << "Dia: " << match[1]  << match[4]  << match[7]  << match[11] << std::endl
                  << "Mês: " << match[3]  << match[6]  << match[9]  << match[13] << std::endl
                  << "Ano: " << match[10] << match[14] << std::endl;
    } else {
        std::cout << "Data inválida!!";
    }
    return 0;
}

Upshot

Data válida: 29/feb/2020
Dia: 29
Mês: feb
Ano: 2020

Example in Ideone

  • Interesting but not understood why you used "constexpr char text" instead of "const char text"? I tested its regex and it is validating formats like 29-02.2000 instead of only being 29/02/2000 or 29-Feb-200 or 29.2.200 tried something like [-|/|. ] but still validating 29-02.2000..

  • I had tried something similar also that I put your suggestion and even then it did not work still validating: the same way. This is my code: https://pastebin.com/uztcDi7J

  • @dark777 Why aren’t you wearing a backreference for the second tab. See the difference I used (?:\2|\5|\8) and \12 in my pattern

  • so I tried without backreferense and even then it didn’t work. ([-/.]) I tried to use it as well (?:[-/.]) or ([-|/|.]) or (?:[-|/|.]) and yet none of them worked..

  • @dark777 Maybe I wasn’t clear enough. You MUST use a backreference. This is your code, edited with the correct regex: https://wandbox.org/permlink/nlctMLXJaGUFfNdu (and check for corrections on print())

  • 1

    Annoying little thing to understand, long time since I was researching how to solve this problem besides some others in the same now it is all right thanks for the help and attention..

Show 1 more comment

2

validate the next entry form of the dates 29/Feb/2000 [...]

If you want to validate only the input format try this validation regex:

\d{2}\/[a-zA-Z]{3}\/\d{4}|\d{2}\/\d{2}\/\d{4}

But if you want to do a validation that accepts only the months of the year, I suggest you do not use regex, try to make a comparison at the entry of the day and month analyzing if it is equal to the allowed entries (If that is the case, comment that I can modify the answer).


Explanation

Validates if the sequence is:

  • 2 digits
  • 1 /
  • 3 Characters from a to z
  • 1 /
  • 4 digits
    Or
  • 2 digits
  • 1 /
  • 2 digits
  • 1 /
  • 4 digits

You can also see an example of this regex working here.

  • interesting but as I’m studying the c++ Std::regex was trying to do this with her, I have one that checks the bisext years and accepts from 2000/jan/30 to 2000/Dec/30 or 2000/01/30 to 2000/Dec/30 it checks bisext years correctly but valida yyyy/mm/dd was trying to do the opposite to validate dd/mm/yyyy got but now needed to insert the strings of the months in it.

  • @dark777 got it, so you want to validate the months now? validate only months like jan, feb, mar, apr? You can use several "|" (or’s) nestled in the regex, but it will become a big kkkk monster

1

While it is possible to do this through regular expressions, I do not believe it is the best way in any programming language. (I don’t know why I answered the question thinking it was Python’s too - but most of the answer, except the exact code of the example, applies)

Month names will be much easier to check, check, and above all - "take the month number", to have an object date real if you check these month names out of the regular expression.

Also if your application is ever going to work in a language other than English: there are frameworks for transforming programs into multi-language programs, and in general they depend on you placing all the strings of your program within a function call (often with a name intended to be almost transparent as _()). This function then searches its string in the desired language in the translation database. If the months names are hardcoded within the regular expression, you would have to pass the entire regexp to the translation engine.

Of course, it would be possible to assemble a regular expression template, with the names of the months in external variables, and to join everything using string interpolation, before calling the regular expression function - this is one of the advantages of Python regular expressions being usable through normal function calls without having a special syntax.

But regular expression is hard enough to read and keep in code. Switchable regular expressions in Runtime would be even more complicated to read.

My tip, as in the first paragraph, would be to use the regular expression to get the groups with day, month and year, and then a quieter mechanism, with dictionaries and if’s to extract the "real month". And take this opportunity, to validate days of the month, year, and etc...also outside the context of regular expression. I’ll put an example in Python, which is a great pseudo-code for C++ - but you’ll get an idea of the problem:

So instead of:

def validate_date(text):
    if re.search(super_complicated_auto_validating_regexp, text):
        return True
    return False

It is possible to write something like:

short_months = {"jan": 1, "fev": 2,...,"dez": 12}

def days_per_month(month, year):
    data = {1: 31, 2: 28, 3: 31, 4:30, ...}
    if month == 2 and year % 4 == 0 and (not year % 100 == 0 or year % 400 == 0):
            return 29
    return data[month]

def parse_date(text):
    match = re.search(r"(\d{1,2})/(.{1,3})/(\d{2,4})", text)
    if not match:
        raise ValueError("Invalid date format")
    day, month, year = [match.group[i] for i in (1,2,3)]
    day = int(day.lstrip("0"))
    if not month.isdigit():
       month = short_months[month.lower()]
    month = int(month.lstrip("0"))
    year = int(year):
    if year < 50:  # assume 2 digit years < 50 are in XXI
          year += 2000
    elif year <= 99:  
         year += 1900
    if day > days_per_month(month, year):
        raise ValueError(f"Invalid day {day} for month {month}")
    result = datetime.date(year=year, month=month, day=day)

Note that more or less 20 lines of programmatic code are needed to perform the Parsing and the validation of the date. With the regular expression approach you have, you want to compress all the logic of these 20 lines into a single 'line', which is actually a mini-program in a language that is not maintenance friendly.

That being said, the most normal way to perform "real" date validation and Parsing in the various crazy formats that users can type, or be in files, is to use a specialized library for this. In it, several people, for hundreds of hours, have already given a thought on how to make the thing friendlier and more proof of error - you would have to duplicate this work in your code (with chances of doing wrong - see the subtlety to correctly calculate leap years - that even microsoft made mistakes in early versions of Excel, for example)

In Python, we have the excellent dateparser, that allows you to simply:

>>> import dateparser
>>> dateparser.parse("25/fev/2018", languages=["pt"])

datetime.datetime(2018, 2, 25, 0, 0)

It allows many other date formats than /, including full written dates in more than 20 languages - and is not prone to errors because of "corner cases".

In C++ I would search for date add-on modules from some framework that you might already be using to provide more functionality to the language - there must be "natural date parsers" using Qt or Boost, for example.

  • The question is labeled with [tag:c++].

  • is - I don’t even know why resondi as if it were Python. Most of the Portuguese text applies however - unfortunately not the code example. I’ll stick to the answer, add the caveats.

  • 1

    Answers, even in languages that diverge from the requested ones, are welcome. Imagine if I catch a similar problem in Python? Or not having a C++ compiler available to make a code to recognize dates? By the way, I really liked the answer

Browser other questions tagged

You are not signed in. Login or sign up in order to post.