Regex to check if string does not end with certain characters

Asked

Viewed 2,539 times

6

I’m racking my brain on the website http://www.regexplanet.com/advanced/java/index.html

I’m trying to make a regex that validates some .txt not containing the characters JJ or M3 at the end of the line.

For example, I have the 3 .txt with the lines below:

.txt 1

  1. 4481;77831853;4461;60;CAD;VCP;M3
  2. 4647;86940830;4847;35;FRA;VCP;M3

.txt 2

  1. 3287;69872804;3297;37;ANT;VCP;JJ
  2. 3827;72247849;3857;38;DEC;VCP;JJ

.txt3

  1. 5634;7082850;5634;40;MAR;VCP;PZ
  2. 4362;3882867;4382;41;PAU;VCP;PZ

I need a regex that won’t accept .txt 1 and 2, only the .txt 3, as the last two characters of them are different from JJ and M3.

5 answers

8

Who accompanies me here at Sopt knows that I do not enjoy much functionalities other than those provided for regular languages.

To reply provided by @nunks.lol uses the Negative lookbehind, which is not regular in the mathematical sense. But this is certainly an optimal solution.

But I can do without lookbehind!

Expression of words that do not end with JJ

The fact that it does not contain these two letters at the end makes the question easier. Just see the answer of that question to see the work that it takes to deny a subpalavra anywhere.

Not to end with JJ, we have 4 alternatives:

  1. The line is blank, so it matches ^$
  2. The line contains exactly one character, therefore ^.$
  3. The last character is not J, ^.*[^J]$
  4. The penultimate character is not J, ^.*[^J].$

So the following expression matches that:

^$|^.$|^.*[^J]$|^.*[^J].$

Ugly, isn’t it? But fortunately it can be simplified:

^(.|.*([^J]|[^J].))?$

I could have simplified even more [^J]|[^J]., but then I would lose the format to be used in the next expression

Expression of words that do not end with M3

Not to end with M3, we have 4 alternatives:

  1. The line is blank, so it matches ^$
  2. The line contains exactly one character, therefore ^.$
  3. The last character is not 3, ` . *[ 3]$
  4. The penultimate character is not M, ^.*[^M].$

I could put the ugly version and then the simplified one, but I can also abbreviate:

^(.|.*([^3]|[^M].))?$

Putting it all together

To put it all together, you have some special cases to consider:

  1. Can end in J if the penultimate letter is M
  2. Can end in 3 if the penultimate letter is J

Moreover, agglutination of the denied lists makes the service. These would be the only cases not dealt with by the previous abstraction.

^(.|.*([^J3]|[^JM].|J3|MJ))?$
  • 1

    +1 for the exit without lookaround! Your answer fits any regex implementation.

  • @nunks.lol if I tell you I never learned the Lookahead, do you believe? I learned regex from grep, sed and vim and I have not yet needed this resource (which does not belong to the regular languages but is called regex)

  • 1

    +1 I liked the part ([^J3]|[^JM].|J3|MJ), tip you can simplify .|.* => .*

  • @Guilhermelautert but is applicable? The .* comes concatenated with things at the end, already the . is the only solitary character. I couldn’t see the simplification, but the moment I get it I add in the answer

  • 1

    @Jeffersonquesado in fact was my mistake, I could not reduce, I hit the eye ? and I thought it was for the ([^J3]|[^JM].|J3|MJ)

  • @Jeffersonquesado I believe! I also learned regex with sed and use daily in the Solaris shell (it’s not GNU sed, it doesn’t support a lot of things, it has to escape parentheses, etc.). Use lookaround to make life easier in cases of use that support, but tend to forget that it exists most of the time, also for performance issues when the expression will be invoked many times per second (alias, I will add this note in my reply)

Show 1 more comment

6

Use the rating of "Negative lookbehind" to ensure that the string does not end in the patterns you defined just before the line break, i.e., $ that does not have JJ or M3 before it. Thus, its regular expression remains:

^.*(?<!JJ|M3)$

Detailing:

^          # início da linha
 .*        # qualquer caractere, zero ou mais vezes
   (?<!    # abertura do negative lookbehind
       JJ  # sequencia literal "JJ"
        |  # condicional "ou"
       M3  # sequencia literal "M3"
   )       # fechamento do negative lookbehind
$          # final da linha

Example in regex101.com: https://regex101.com/r/2loAEN/2

(I included some lines with variable ending in J and 3 to demonstrate that the expression does not deny more than it should)

Interesting explanation of lookaround in regular expressions: https://stackoverflow.com/questions/2973436/regex-lookahead-lookbehind-and-atomic-groups

Complementing: an issue to be taken into account when using Lookahead and lookbehind (the so-called lookaround) is the performance. The use of lookaround tends to use a little more CPU than the match of "traditional" regular expressions. If you intend to apply this expression massively, with many calls per second, it may be advantageous to use more "verbosity" methods, such as the reply given by @Jeffersonquesado.

2

Complementing the other answers, there is another alternative: to check if something nay is something, it is often easier to check if it is, and simply reverse the result. So you could use this regex:

String[] strings = { "4481;77831853;4461;60;CAD;VCP;M3", "4647;86940830;4847;35;FRA;VCP;M3",
                     "3287;69872804;3297;37;ANT;VCP;JJ", "3827;72247849;3857;38;DEC;VCP;JJ",
                     "5634;7082850;5634;40;MAR;VCP;PZ", "4362;3882867;4382;41;PAU;VCP;PZ",
                     "J", "3827;72247849;3857;38;DEC;VCP;AJ", "AJ", "JJ", "M3" };
for(String s : strings) {
    System.out.println(s + " = " + (! s.matches("^.*(JJ|M3)$")));
}

The regex is ^.*(JJ|M3)$. The markers ^ and $ are the beginning and end of the string, to ensure that it only has that is specified in regex.

.* is "zero or more characters" (the point is any character - except line breaks) and the quantifier * means "zero or more occurrences".

JJ|M3 is a alternation, which corresponds toJJ or M3.

That is, regex checks whether the string ends with "JJ" or "M3". Therefore, it serves to check what you nay wants. So I reverse the result of method matches, using !: if the regex finds a match, it corresponds to what you don’t want, and vice versa.

The exit code above is:

4481;77831853;4461;60;CAD;VCP;M3 = false
4647;86940830;4847;35;FRA;VCP;M3 = false
3287;69872804;3297;37;ANT;VCP;JJ = false
3827;72247849;3857;38;DEC;VCP;JJ = false
5634;7082850;5634;40;MAR;VCP;PZ = true
4362;3882867;4382;41;PAU;VCP;PZ = true
J = true
3827;72247849;3857;38;DEC;VCP;AJ = true
AJ = true
JJ = false
M3 = false

A catch from the previous code is that the method matches creates a new instance of Pattern, and how he’s being called into a loop, several instances are created. But since they all use the same regex, you could improve this by using - and reusing - a single instance, creating it outside the loop:

Matcher matcher = Pattern.compile("^.*(JJ|M3)$").matcher("");
for (String s : strings) {
    System.out.println(s + " = " + (!matcher.reset(s).matches()));
}

The output is the same as the previous code.


This case can also be solved without regex:

for (String s : strings) {
    System.out.println(s + " = " + !(s.endsWith("JJ") || s.endsWith("M3")));
}

The method endsWith checks if the string ends with certain characters. In this case, just check if it ends with "JJ" or "M3", and deny the result.

  • 1

    I had not even imagined in the solution without regex!

1

I believe you are reading the file line by line, so using this should solve.
The simplest regex I ever imagined was (.*(?!(M3|JJ))..|^.)$.

  • .*: Accepts first line characters
  • (?!(M3|JJ)): Checks whether the string contains M3 or JJ characters
  • ..: Ensures that there will be two characters at the end of the line, otherwise M3 and JJ would pass
  • ^.: Allows there to be a line with only one character
  • $: End of input, to ensure that the last accepted characters are the last of the line
  • I didn’t understand your answer, find it confusing. It would have how to better organize the thought?

  • Okay, I’ll do it differently

0

A differentiated approach will be precisely to seek the negative result:

...

String pattern = "/(JJ|M3)/";
Pattern regexp = Pattern.compile(pattern);
Matcher m = regexp.matcher(linhaLidaDoArquivo);

// Existe também o método find() para o objeto m
// o qual você poderá iterar e verificar quantas
// ocorrências encontrou no padrão fornecido
if(pattern.matches(m)) {
    System.out.println("O seu arquivo é INVÁLIDO");
else
    System.out.println("O seu arquivo é VÁLIDO");

In other words: you will know and control what you nay or occurs. The above code searches for combinations (uppercase) with the pattern "JJ" and "M3" at any position in the string. If they occur within the read line, then the classes Pattern and Matcher will identify such occurrence. If found, the if will return true and the treatment of the state negative can be used. For other cases, you will treat them with the expected and valid scenario.

  • 1

    I’ve never seen in Java compile regex /between bars/. Not to mention that the negative lookbehind do @nunks deals with what you proposed in the most elegant way.

  • The method matches class String gets another String as a parameter, but you are passing the Matcher (i.e., the code does not even compile). Moreover, it is right to do stringQueQueroVerificar.matches(stringContendoPattern), but you try to do the opposite. Not to mention that the regex is wrong, as @Jeffersonquesado said - despite that, the idea of checking if you have JJ or M3 (instead of doing a regex that checks if you don’t have it) is good, so much so that I did this in my reply :-)

Browser other questions tagged

You are not signed in. Login or sign up in order to post.