Regex to validate a specific amount of sequence appearances " - "

Question

Regex to validate a specific amount of sequence appearances " - "

Asked 5 years, 9 months ago

Viewed 226 times

1

I need to apply a regex pattern to a title. regex is saved to the bank and the title is only accepted if it is within the default.

The pattern is as follows:

text - text - text

In this title there should only occur twice the " - ", and if more times occur, the text is invalidated.

Texts can contain any character, except the sequence that divides it (" - "). I have tried several combinations and have not yet reached the expected result, ex:

^([\wÀ-ú\- ]+( - )[\wÀ-ú\- ]+( - )[\wÀ-ú\-]+)$

Try something like ^[^-]+ - [^-]+ - [^-]+$

– Woss

2019/09/24 at 20:49
I tested on the site [ https://regexr.com/ ] and just taking the part - has already had the result you seek ^([\wLow ]+ - [ wLow ]+ - [ wLow]+)$

– Penachia

2019/09/24 at 20:52
1

If you are using some programming language, it is easier to split and check if the result has 3 non-empty items

– hkotsubo

2019/09/24 at 20:58

1 answer

Browser other questions tagged regex

You are not signed in. Login or sign up in order to post.

by hkotsubo • **55,826** points · Answer 1 · 2019-09-25T00:42:57+00:00

^{You didn’t say which language or program you’re using, which would make it easier, because with a programming language it would be much easier to do the split and check that the list has 3 non-empty entries - example. Anyway, it follows a solution that should work for most Engines/languages.}

Its regex does not work because of [\wÀ-ú\- ]. The brackets define a character class, that take any character that is between them. And notice that inside them has \-, which corresponds to the hyphen. That is, this section also picks up a hyphen, which is exactly what you don’t want. (because of this, the regex considers that texts such as - - - - - are valid, see).

If you want there to be only two hyphens, and between them there is no other, you can just take them out of the brackets:

^[\wÀ-ú ]+ - [\wÀ-ú ]+ - [\wÀ-ú]+$

Note that I also removed the parentheses, because they seem unnecessary here. See the difference for the previous regex.

As the part [\wÀ-ú ]+ - if repeated twice, the regex could also be:

^([\wÀ-ú ]+ - ){2}[\wÀ-ú]+$

Now it makes sense to place parentheses, because everything inside them is repeated twice (indicated by quantifier {2}). See here the regex working.

Even this construction is useful to vary the amount of repetitions. If you want exactly 5 occurrences, for example, change the {2} for {5}. If you want at least 2 occurrences, no maximum limit, use {2,}, and if you want at least 2 and at most 5, use {2,5}.

Just a few details to complement:

The shortcut \w corresponds to letters, numbers and the character _. This last character is often not remembered by people, and often the regex ends up picking up things it shouldn’t because of it (for example, the regex considers that the text _ - ____ - _ is valid, see). This happens because of the quantifier +, meaning "one or more occurrences". That is to say, [\wÀ-ú ]+ takes one or more occurrences of any character that fits what is between brackets. Like the _ fits into the \w, then several _ is also considered valid by regex.

Also, depending on the language/engine/program, the \w may consider only letters of a to z (upper and lower case) and digits from 0 to 9, or consider all letters and numbers defined by Unicode (including accented characters and letters from other alphabets, such as Japanese, Arabic, Russian, etc.). As it was not specified which language is being used, it is not possible to suggest how to change (some have options/parameters/flags who make the \w behave one way or another).

If the \w consider all letters defined by Unicode, you would not need to use the range À-ú, for example. Remembering that this range includes all characters between the U+00C0 and U+00FA code points, you can see which ones are this list. Note that in this list there are characters that are not letters, such as ÷ (DIVISION SIGN). That is, regex considers that the text ÷÷ - ÷ - ÷÷÷ is valid (see).

^{To learn more about what a code point is, see this question.}

Another detail is that [\wÀ-ú\- ]+ also has a space before the ], which means that a text with several spaces between the hyphens, such as - - x, is also considered valid (see).

So if you want to improve the regex and avoid these false positives, you have to be more specific. Or, if your entries are controlled (for example, if the text is generated by some process that ensures that there will always be something significant between the hyphens and "never" will have the strange cases already cited), then it’s okay to keep the regex simpler.

If so, you can further simplify and use a character class denied, to pick up anything that nay be a hyphen:

^([^-]+ - ){2}[^- ]+$

In the case, [^-] is "any character other than a hyphen", and [^- ] is "any character that is not hyphenated or space" (since the latter did not have space).

Just remembering that the first option does not take line breaks, already this option takes (see). If you want to avoid picking line breaks, just include them in the denied character class:

^([^-\n\r]+ - ){2}[^-\s]+$

The first case [^-\n\r] excludes hyphen and line breaks, while the second has the shortcut \s, that picks up space and line breaks (in addition to several other characters, but the exact list varies according to the language/engine/program used). Thus, the regex no longer picks up the line breaks between the hyphens (see).

Anyway, you can speculate about different options to consider or not certain cases, but how you did not specify what the text is like (if you always start with letter, if you can have numbers (how many and in which positions, if you do, etc), number of spaces between words, etc), I believe that here we already have enough for you to start.

You said the hyphens have "any character," but is it really anyone? Unicode currently defines more than 130,000 characters, and I doubt that you really want to accept them all, since not all of them make sense in all contexts. Depending on how your texts look and how much you accept to deal with false positives (if they occur), you can adjust the regex to make it more specific.