Regular expression for alphanumeric characters, hyphen, space or single quotes

Question

Regular expression for alphanumeric characters, hyphen, space or single quotes

Asked 5 years, 11 months ago

Viewed 1,070 times

1

I’m looking for a regex that accepts alpha-numeric characters, spaces, ' and -.

These examples should be accepted: "Jean-da Silva", "Carlos 2", "João d'lango".

Examples not accepted: "J@ão", "Carlos*".

Igor, I updated the answer (one had been missing break in the for - in the second code block)

– hkotsubo

2019/08/28 at 20:29

2 answers

2

An alternative ("simplistic" and "naive" - we’ll understand why) would be:

String[] v = { "Jean-da Silva", "Carlos 2", "João d'lango", "J@ão", "Carlos*", "teste_1" };
Matcher matcher = Pattern.compile("^[-\\w' &&[^_]]+$", Pattern.UNICODE_CHARACTER_CLASS).matcher("");
for (String s : v) {
    matcher.reset(s);
    System.out.println(s + "=" + (matcher.find() ? "válida" : "inválida"));
}

The regex uses a character class (bounded by square brackets []).

Within this character class, we have a hyphen, the shortcut \w (which corresponds to letters, numbers and the character _), a "single quotation marks" ' and a space (note that there is a space between the ' and the &). Like the \w also picks up the character _, I use the intersection syntax (&&) with a character class denied (the [^_], which excludes the _ of this group).

Then I use the quantifier +, meaning "one or more occurrences". That is, I can have one or more characters from the list (letters, numbers, hyphen, ' or space). Finally, I use the markers ^ and $, which means respectively the beginning and the end of the string. Thus, I guarantee that the string will only have these characters.

Only that, for default, the behavior of the shortcut \w is do not consider accented characters. So I also use the option UNICODE_CHARACTER_CLASS, so that the \w also take the letters with accent.

The exit is:

Jean-da Silva=válida
Carlos 2=válida
João d'lango=válida
J@ão=inválida
Carlos*=inválida
teste_1=inválida

I might as well have used something like [0-9a-záéíóúãõâêîôû] to pick up all the accents (and then you add all the characters you want in the brackets, such as the ç, ñ, and any others you need). But \w with the option UNICODE_CHARACTER_CLASS already covers all these characters (in the end we will see some caveats on this solution).

But like I said, this regex is naive. If the string is "---", "' -", or only has spaces, it is also considered valid.

A slightly better alternative would be to break the string by the separators (space, ' or hyphen) and check that each of the parts has only letters and numbers:

String[] v = { "Jean-da Silva", "Carlos 2", "João d'lango", "J@ão", "Carlos*", "teste_1", "---", "' -" };
Matcher matcher = Pattern.compile("^[\\w&&[^_]]+$", Pattern.UNICODE_CHARACTER_CLASS).matcher("");
for (String s : v) {
    boolean valida = false;
    // quebrar pelos separadores
    for (String parte : s.split("[-' ]")) {
        matcher.reset(parte);
        valida = matcher.find();
        if (!valida) // se encontrou um inválido, pode sair do loop
            break;
    }
    System.out.println(s + "=" + (valida ? "válida" : "inválida"));
}

Now I use split to break the string into pieces.

In the split i use [-' ] as the delimiter, indicating that I want to separate the string by hyphens, spaces or ' (notice that there is a gap between the ' and the ]). The result is an array containing the parts of the string after separate.

Then, for each part, I check if it corresponds to ^[\\w&&[^_]]+$ (letters or numbers, from start to end of string). I keep using the option UNICODE_CHARACTER_CLASS so that the \w take accented characters, and use the intersection &&[^_] so that she does not consider the _.

The exit is:

Jean-da Silva=válida
Carlos 2=válida
João d'lango=válida
J@ão=inválida
Carlos*=inválida
teste_1=inválida
---=inválida
' -=inválida

Another alternative would be to use the Unicode properties:

Matcher matcher = Pattern.compile("^[\\p{L}\\p{N}]+$", Pattern.UNICODE_CHARACTER_CLASS).matcher("");
// for com split, igual ao código anterior

In this case, regex only considers characters that are letters, including accented characters (\p{L}) or numbers (\p{N}).

A detail of the above options is that they are very comprehensive and also consider characters that you might not want.

When the option UNICODE_CHARACTER_CLASS is used, the \w also considers other characters, such as those who own the property Join_Control (which in this case are the ZERO WIDTH NON JOINER and the ZERO WIDTH JOINER), in addition to several other characters. And the \p{L} considers letters from other alphabets, such as Japanese and Arabic, for example.

If you don’t want to be so comprehensive, an alternative is to use java.text.Normalizer together with another regex to eliminate accents, and search only for letters of a to z and numbers from 0 to 9:

// opção CASE_INSENSITIVE para considerar letras maiúsculas e minúsculas
Matcher matcher = Pattern.compile("^[a-z0-9]+$", Pattern.CASE_INSENSITIVE).matcher("");
for (String s : v) {
    boolean valida = false;
    for (String parte : s.split("[-' ]")) {
        // elimina os acentos
        matcher.reset(Normalizer.normalize(parte, Normalizer.Form.NFD).replaceAll("\\p{M}", ""));
        valida = matcher.find();
        if (!valida) // se encontrou um inválido, pode sair do loop
            break;
    }
    System.out.println(s + "=" + (valida ? "válida" : "inválida"));
}

In a well summarized form, the normalization for the NFD form "breaks" a character accentuated in two. For example, the ã is broken into a and ~. (for more details on normalization, read here, here and here).

Next \p{M} serves to remove characters corresponding to accents (called Diacritical Combining). In the end, there are only the letters, without the accents, that I can look for with ^[a-z0-9]+$ (thanks to the option CASE_INSENSITIVE, it already takes both upper and lower case letters).

Browser other questions tagged java regex

You are not signed in. Login or sign up in order to post.

by Scarabelo • **331** points · Answer 1 · 2019-08-28T19:00:47+00:00

-1

Regex below accepts only characters, numbers and special characters "'" and "-"

[a-zA-Z0-9 '-]

I tested on https://regexr.com/ and it worked, Z-Z, really 0 ta missing I will edit, for better regex reading, I changed the Z-Z to A-Z

– Scarabelo

2019/08/28 at 19:13
1

Only the Z-Z alone does not work, see

– hkotsubo

2019/08/28 at 19:16
https://regexr.com prank/

– Scarabelo

2019/08/28 at 19:17