Regular Expression is correct but does not enter if

Asked

Viewed 142 times

2

I wonder why my code isn’t getting into this if more internal.

I tested regex for some websites and it’s working. I tested with this value: nucepe-2018-pc-pi-perito-criminal-informatica-prova.pdf.

#!/bin/bash

#Declarando regex que vão ser utilizadas
regex_prova='(prova)'
regex_banca='((?:[a-z][a-z0-9_]*))'
regex_ano='\b(19|20)\d{2}\b'
regex_orgao='.*?(?:[a-z][a-z0-9_]*).*?((?:[a-z][a-z0-9_]*))'
regex_cargo=

#Percorrendo todos os arquivos pdf na pasta
for file in *.pdf

do 
    #Pegando o nome do pdf
    str=$file

    #Verificando se existe a palavra prova no nome do arquivo
    if [[ $str =~ $regex_prova ]]; then

        #Pega qual é a banca
        if [[ $str =~ $regex_banca ]]; then

            echo $str  #Não entra aqui
            #echo ${BASH_REMATCH[1]}  #Não entra aqui
        else
            echo 'Erro na Regex_banca'
        fi

        #Pega qual é o ano da prova
        if [[ $str =~ $regex_ano ]]; then

            echo ${BASH_REMATCH[1]}
        else
            echo 'Erro no Regex_Ano'
        fi

    else
        echo 'Erro na Regex_prova'
    fi


done
  • 1

    It would be nice edit the question and explain how you want to capture the data! You said you tested and it worked, with me gave error in year expression and bankroll. Already say that the expression to capture the year is incorrect, you can do so: regex_ano='([[:digit:]]{4})' or so: regex_ano='([0-9]{4})'

1 answer

2


The problem with regex is that there are several Flavors (flavors) different. I would say that if regex were a language, it would have several dialects (including, in some documentations, these variations are called "dialects" even).

There’s various resources provided in its syntax, but each language and engine implements a subset of them. There are even differences in syntax, so what works in one environment may not work in another. That’s why your script expressions may work on some websites but not on Bash.

Probably the websites you tested are using Engines compatible with PCRE (Perl Compatible Regular Expressions) or some other, while the Bash uses the WILDEBEEST (Extended Regular Expressions).

You can enter on this page and choose on the two combos at the top of the page the options "GNU ERE" and "PCRE" and see that the shortcut \d is only implemented by the second:

tabela mostrando que <code>\d</code> não é suportado pela engine GNU ERE

\d is a shortcut to "digits", and WILDEBEEST we can use an equivalent expression, which is [0-9] - the brackets define a character class, that is, it accepts whatever is inside them. In the case, [0-9] means "any digit from 0 to 9".

Already on this other page (also choosing "GNU ERE" and "PCRE" in the combos at the top of the page), see that the WILDEBEEST does not support syntax (?:, which is a catch group.

How you are using the variable BASH_REMATCH, it makes no sense to use no-catch groups, since they are opposite things. BASH_REMATCH will store what is captured by the parentheses, but only if these are a catch group (and for that they cannot be with the ?:). And anyway, Bash doesn’t recognize this syntax, so we can remove it.

And even if he did, do ((?:alguma_coisa)) does not make sense, because it is a group of no-capture (ie, something you do not want to capture to BASH_REMATCH) within a capture group (something you want to capture).

Already the \b is a shortcut to word Boundary, something like "word boundary" (i.e., any character that is not part of a word, and whose exact definition varies depending on the dialect/engine/language used).

In order not to depend on these variations, I switched to [^0-9]: the ^ inside the brackets says to deny everything inside it. That is to say, [^0-9] means "anything other than a digit from 0 to 9". I believe this is enough to separate the year from the other characters.

In short, take out the no-catch groups ((?:) and change \d for [0-9] and \b for [^0-9]. Then the expressions stay like this:

regex_prova='prova'
regex_banca='([a-z][a-z0-9_]*)'
regex_ano='[^0-9]((19|20)[0-9]{2})[^0-9]'

I removed the parentheses of prova because you’re not capturing this value, just testing it, so you don’t need the parentheses.

The code is as below, but watch out because I removed the for (to facilitate testing) and print the match obtained using BASH_REMATCH (this variable is an array, and position 1 corresponds to the first pair of regex parentheses):

str="nucepe-2018-pc-pi-perito-criminal-informatica-prova.pdf"

#Verificando se existe a palavra prova no nome do arquivo
if [[ $str =~ $regex_prova ]]; then

    #Pega qual é a banca
    if [[ $str =~ $regex_banca ]]; then
        echo  "Banca:" ${BASH_REMATCH[1]}
    else
        echo 'Erro na Regex_banca'
    fi

    #Pega qual é o ano da prova
    if [[ $str =~ $regex_ano ]]; then
        echo "Ano:" ${BASH_REMATCH[1]}
    else
        echo 'Erro no Regex_Ano'
    fi

else
    echo 'Erro na Regex_prova'
fi

The exit is:

Banca: nucepe
Ano: 2018

If the year is always between hyphens, you can also change the regex to:

regex_ano='-((19|20)[0-9]{2})-'

PS: regex_orgao was not used in her script, but she also has some bites. She makes use of .*?: the question mark after the asterisk is a Lazy quantifier (lazy quantifier). Basically, * means "zero or more occurrences", but by default it is "greedy" as it tries to grab as many characters as possible. Using *? he becomes "lazy", picking up as few characters as possible that satisfy the expression.

But unfortunately, if we look on this page (and again choose "GNU ERE" and "PCRE" in the combos at the top of the page), we will see that the WILDEBEEST does not support this syntax.

Testing your regex with the name of the file you used, the value obtained for the organ was pc. In this case, one can simplify the expression for:

regex_orgao='([^\-]+-)([a-z][a-z0-9_]*)'

Note that it has two sets of parentheses, which means that BASH_REMATCH will have two groups (if the string is in accordance with regex). The first group contains [^\-]+-, which means:

  • [^\-]+: one or more occurrences (+) anything but a hyphen ([^\-]). How the hyphen has special meaning inside the brackets (serves to define an interval, as in [0-9]), I needed to escape it with \
  • -: a hyphen

That is, they are several characters (other than a hyphen), followed by a hyphen. This expression assumes that the hyphen is the separator of the "fields" that make up the file name.

The second pair of parentheses is the one you were already using:

  • [a-z]: a letter of a to z
  • [a-z0-9_]*: zero or more occurrences (*) of a letter of a to z or a digit from 0 to 9 or the character _

That is to say, regex_orgao is picking up characters followed by a hyphen, followed by "letter + letters or numbers or _". So she doesn’t take the stretch nucepe-2018, since after the hyphen has only numbers (that is, it does not correspond to the passage [a-z]).

The stretch she ends up taking is 2018-pc, whereas 2018- is captured for the first group (as it corresponds to the first pair of parentheses), and pc is captured for the second group.

To get the values of these groups, just do ${BASH_REMATCH[1]} and ${BASH_REMATCH[2]}. But since what matters seems to be only the second parenthesis, we can discard the first. The excerpt of the script would look like this:

if [[ $str =~ $regex_orgao ]]; then
    echo "orgao: " ${BASH_REMATCH[2]}
fi

And in this case, the organ obtained is pc.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.