Regex that accepts only letters or letters and numbers

Asked

Viewed 787 times

-2

I’m trying to make a regex in Java that accepts:

  • only letters;
  • letters and numbers
  • cannot have only numbers
  • cannot have punctuation characters or special

My difficulty is that I cannot make a regex that accepts, for example, l0v3y0oplj or 1ads967bjk. I’ll show you my code, so far:

import java.util.regex.Matcher;
import java.util.regex.Pattern;


public class Teste {

    public static void main(String[] args) {
       
       Pattern p=Pattern.compile("([0-9]*[a-zA-Z]+)|([a-zA-Z]+[0-9]*)");
       Matcher a,b,c,d,e,f,g,h,i,j;
               a=p.matcher("admin");       //eu quero que seja TRUE
               b=p.matcher("l0v3y0oplj");   //eu quero que seja TRUE
               c=p.matcher("123admin");      //eu quero que seja TRUE
               d=p.matcher("123ADMIN?");      //eu quero que seja FALSE
               e=p.matcher("123ADMIN");       //eu quero que seja TRUE
               f=p.matcher("am.ouy?");  //eu quero que seja FALSE
               g=p.matcher("12345678909776apouatgb");    //eu quero que seja TRUE
               h=p.matcher("1ads967bjk");  //eu quero que seja TRUE
               i=p.matcher("ADMIN123");  //eu quero que seja TRUE
               j=p.matcher("123");      //eu quero que seja FALSE
               
               System.out.println("A matches: "+a.matches());//resultado: TRUE
               System.out.println("B matches: "+b.matches());//resultado: FALSE
               System.out.println("C matches: "+c.matches());//resultado: TRUE
               System.out.println("D matches: "+d.matches());//resultado: FALSE
               System.out.println("E matches: "+e.matches());//resultado: TRUE
               System.out.println("F matches: "+f.matches());//resultado: FALSE
               System.out.println("G matches: "+g.matches());//resultado: TRUE
               System.out.println("H matches: "+h.matches());//resultado: FALSE
               System.out.println("I matches: "+i.matches());//resultado: TRUE
               System.out.println("J matches: "+j.matches());//resultado: FALSE
    }
   
}

2 answers

3


In my opinion, this problem is simpler to solve without regex, but anyway, we will see a solution with and another without, then you draw your own conclusions.


The problem with her regex is that she only considers a few cases:

  • [0-9]*[a-zA-Z]+ means "zero or more numbers, followed by one or more letters"
  • [a-zA-Z]+[0-9]* means "one or more letters, followed by zero or more numbers"

These options are part of a alternation - the character |, meaning "or" - then regex checks the options separately, from left to right (if not first, try the second, if none, it fails and none match is found).

That is, if it has letters, numbers, and then other letters (or numbers, letters, and then other numbers), regex no longer considers it, as it does not fit in either case. That’s why she fails in cases like l0v3y0oplj and 1ads967bjk.

For example, in 1ads967bjk, first regex tries with option [0-9]*[a-zA-Z]+, then the excerpt [0-9]* get the number 1, afterward [a-zA-Z]+ take the stretch ads, and then there’s no excerpt from the expression to take from the 9 onward. Then she tries the second stretch of the alternation ([a-zA-Z]+[0-9]*), but as the string starts with number, it already fails in the [a-zA-Z]+. So the string does not match the expression.

One way to solve it is to consider that at the beginning and at the end may have zero or more letters or numbers, and in the middle must have at least one letter. That is to say:

Pattern pattern = Pattern.compile("^[a-zA-Z0-9]*[a-zA-Z]+[a-zA-Z0-9]*$");

This is basically what was suggested in another answer, the difference is that it did not consider the uppercase letters. But if you want, you can also leave the expression only with lowercase letters and use the option CASE_INSENSITIVE, so regex will also consider uppercase letters:

// Pattern.CASE_INSENSITIVE para não diferenciar maiúsculas e minúsculas
Pattern pattern = Pattern.compile("^[a-z0-9]*[a-z]+[a-z0-9]*$", Pattern.CASE_INSENSITIVE);

Or even, the break 0-9 can be exchanged for shortcut \d (remembering that inside strings the character \ should be written as \\):

Pattern pattern = Pattern.compile("^[a-zA-Z\\d]*[a-zA-Z]+[a-zA-Z\\d]*$");

The markers ^ and $ indicate respectively the beginning and end of the string, so I guarantee that the string can only have what is in regex (okay that matches always checks the entire string, but I have the habit of make it clear in regex when I’m checking the whole string or not).

Then we have [a-zA-Z0-9]* (the quantifier * indicates "zero or more", so here we have zero or more letters or numbers). Then we have [a-zA-Z]+ (the quantifier + indicates "one or more", so here we have one or more letters, I mean, I’m making sure it should have at least one letter). Finally, we have again zero or more letters or numbers.

So I guarantee I can have letters, then numbers, then more letters, then more numbers, etc. Notice I didn’t even have to deal with the condition "cannot have punctuation characters or special", because by putting only what I want (letters or numbers), automatically any other character other than letter or number is already rejected.


Another alternative (a little more complicated) is:

Pattern pattern = Pattern.compile("^(?![0-9]+$)(?=.*[a-zA-Z])[a-zA-Z0-9]+$");

This regex uses lookaheads, which serve to check whether something exists or not in front:

  • (?![0-9]+$): this is a Lookahead negative whether something nay there is ahead. And that something is [0-9]+$ (that is, one or more numbers, until the end of the string). So this is to check if the regex only has digits (and if it does, the regex fails)
  • (?=.*[a-zA-Z]): this is a Lookahead positive that checks if something exists ahead. And that something is .*[a-zA-Z], i.e., zero or more characters (.*) followed by a letter. So this is to check if there is a letter somewhere in the string (i.e., it ensures that there is at least one letter)

The detail of Lookahead is that it just checks what’s ahead and then goes back to where it was and continues to check the rest. And as they are just after the ^, check is done at the beginning of the string. After they do the checks, regex proceeds and checks [a-zA-Z0-9]+$ (one or more letters or numbers, until the end of the string).

For being "going back and forth" in the string, the regex with lookaheads is a little slower than the first (compare here and here). But of course for a few small strings, the difference will be imperceptible.


It is also worth remembering that these solutions do not consider accented letters or other alphabets. If you want be more comprehensive and consider this too, an option is to use Unicode properties:

Pattern pattern = Pattern.compile("^[\\p{L}\\p{N}]*\\p{L}+[\\p{L}\\p{N}]*$");

In the case, \p{L} are all letters defined by Unicode (are all categories starting with "L" from this list), and \p{N} are all digits defined by Unicode (which goes beyond the digits of 0 to 9, see the full list here, here and here).

That is, it is the same idea as the first regex: at the beginning and at the end they can have zero or more letters or numbers, and in the middle they must have at least one letter. What changes is only the definition of what is a "letter" and a "number".

With this regex, strings like "親41áç123Ã۹" (the character ۹ is one of those considered "digits" by Unicode - see here the definition of it).


Regex-free

But like I said, it might be easier to do without regex. Just scroll through the string characters and go doing the checks:

public boolean verifica(String s) {
    boolean temLetra = false;
    for (int i = 0; i < s.length(); i++) {
        char c = s.charAt(i);
        boolean isLetra = ('a' <= c && c <= 'z') || ('A' <= c && c <= 'Z');
        if (isLetra && !temLetra) // se tem pelo menos uma letra
            temLetra = true;
        if (! (('0' <= c && c <= '9') || isLetra)) // se não é número nem letra, nem precisa verificar o resto
            return false;
    }
    return temLetra;
}

...
System.out.println(verifica("admin")); // true

In the loop I see if there are any letters, and at the same time I check if it is a letter or a number. If in the middle I find something that is neither number nor letter, I return already false (Because I already know it’s invalid and then it’s no use checking the rest). This solution works for all the cases you mentioned (but does not consider accented letters or other alphabets, as the previous example with Unicode properties).

If you want to consider all the letters and digits defined by Unicode, you can use the methods isLetter and isDigit class Character:

public boolean verifica(String s) {
    boolean temLetra = false;
    for (int i = 0; i < s.length(); i++) {
        char c = s.charAt(i);
        boolean isLetra = Character.isLetter(c);
        if (isLetra && !temLetra) // se tem pelo menos uma letra
            temLetra = true;
        if (!(Character.isDigit(c) || isLetra)) // se não é número nem letra, nem precisa verificar o resto
            return false;
    }
    return temLetra;
}

At first glance, the above code may look worse because the solution with regex "has fewer lines", but smaller code is not necessarily better. Regex has a overhead that is often not noticed (even more for small strings being checked a few times, the difference turns out to be imperceptible), but depending on the situation, it can rather be a performance bottleneck.

Regex is legal and I particularly like it very much, but is not always the best solution.

0

I would simplify your points "a", "b", "c" and "d", by:

  1. Letters and Numbers
  2. Numbers are optional
  3. It can’t be empty

And this expression is the answer:

^[a-z0-9]*[a-z]+[a-z0-9]*$

You can test her here.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.