Regex to validate a specific amount of sequence appearances " - "

Asked

Viewed 226 times

1

I need to apply a regex pattern to a title. regex is saved to the bank and the title is only accepted if it is within the default.

The pattern is as follows:

text - text - text

In this title there should only occur twice the " - ", and if more times occur, the text is invalidated.

Texts can contain any character, except the sequence that divides it (" - "). I have tried several combinations and have not yet reached the expected result, ex:

^([\wÀ-ú\- ]+( - )[\wÀ-ú\- ]+( - )[\wÀ-ú\-]+)$
  • Try something like ^[^-]+ - [^-]+ - [^-]+$

  • I tested on the site [ https://regexr.com/ ] and just taking the part - has already had the result you seek ^([\wLow ]+ - [ wLow ]+ - [ wLow]+)$

  • 1

    If you are using some programming language, it is easier to split and check if the result has 3 non-empty items

1 answer

1


You didn’t say which language or program you’re using, which would make it easier, because with a programming language it would be much easier to do the split and check that the list has 3 non-empty entries - example. Anyway, it follows a solution that should work for most Engines/languages.


Its regex does not work because of [\wÀ-ú\- ]. The brackets define a character class, that take any character that is between them. And notice that inside them has \-, which corresponds to the hyphen. That is, this section also picks up a hyphen, which is exactly what you don’t want. (because of this, the regex considers that texts such as - - - - - are valid, see).

If you want there to be only two hyphens, and between them there is no other, you can just take them out of the brackets:

^[\wÀ-ú ]+ - [\wÀ-ú ]+ - [\wÀ-ú]+$

Note that I also removed the parentheses, because they seem unnecessary here. See the difference for the previous regex.

As the part [\wÀ-ú ]+ - if repeated twice, the regex could also be:

^([\wÀ-ú ]+ - ){2}[\wÀ-ú]+$

Now it makes sense to place parentheses, because everything inside them is repeated twice (indicated by quantifier {2}). See here the regex working.

Even this construction is useful to vary the amount of repetitions. If you want exactly 5 occurrences, for example, change the {2} for {5}. If you want at least 2 occurrences, no maximum limit, use {2,}, and if you want at least 2 and at most 5, use {2,5}.


Just a few details to complement:

The shortcut \w corresponds to letters, numbers and the character _. This last character is often not remembered by people, and often the regex ends up picking up things it shouldn’t because of it (for example, the regex considers that the text _ - ____ - _ is valid, see). This happens because of the quantifier +, meaning "one or more occurrences". That is to say, [\wÀ-ú ]+ takes one or more occurrences of any character that fits what is between brackets. Like the _ fits into the \w, then several _ is also considered valid by regex.

Also, depending on the language/engine/program, the \w may consider only letters of a to z (upper and lower case) and digits from 0 to 9, or consider all letters and numbers defined by Unicode (including accented characters and letters from other alphabets, such as Japanese, Arabic, Russian, etc.). As it was not specified which language is being used, it is not possible to suggest how to change (some have options/parameters/flags who make the \w behave one way or another).

If the \w consider all letters defined by Unicode, you would not need to use the range À-ú, for example. Remembering that this range includes all characters between the U+00C0 and U+00FA code points, you can see which ones are this list. Note that in this list there are characters that are not letters, such as ÷ (DIVISION SIGN). That is, regex considers that the text ÷÷ - ÷ - ÷÷÷ is valid (see).

To learn more about what a code point is, see this question.

Another detail is that [\wÀ-ú\- ]+ also has a space before the ], which means that a text with several spaces between the hyphens, such as - - x, is also considered valid (see).

So if you want to improve the regex and avoid these false positives, you have to be more specific. Or, if your entries are controlled (for example, if the text is generated by some process that ensures that there will always be something significant between the hyphens and "never" will have the strange cases already cited), then it’s okay to keep the regex simpler.

If so, you can further simplify and use a character class denied, to pick up anything that nay be a hyphen:

^([^-]+ - ){2}[^- ]+$

In the case, [^-] is "any character other than a hyphen", and [^- ] is "any character that is not hyphenated or space" (since the latter did not have space).

Just remembering that the first option does not take line breaks, already this option takes (see). If you want to avoid picking line breaks, just include them in the denied character class:

^([^-\n\r]+ - ){2}[^-\s]+$

The first case [^-\n\r] excludes hyphen and line breaks, while the second has the shortcut \s, that picks up space and line breaks (in addition to several other characters, but the exact list varies according to the language/engine/program used). Thus, the regex no longer picks up the line breaks between the hyphens (see).


Anyway, you can speculate about different options to consider or not certain cases, but how you did not specify what the text is like (if you always start with letter, if you can have numbers (how many and in which positions, if you do, etc), number of spaces between words, etc), I believe that here we already have enough for you to start.

You said the hyphens have "any character," but is it really anyone? Unicode currently defines more than 130,000 characters, and I doubt that you really want to accept them all, since not all of them make sense in all contexts. Depending on how your texts look and how much you accept to deal with false positives (if they occur), you can adjust the regex to make it more specific.

  • Thank you very much for the answer, it helped a lot. One detail I edited in the question but it has not yet been accepted is that other hyphens may occur in the middle of the text, what can not occur again is the space hyphen space ( - ). So something like: text-pre - text - text-post, can occur. That is the question in this hyphens repetition. Is there any way to do this? I am using Java, but I would like to solve everything in regex, because there are other standards.

  • @giovannijakubiak It depends a lot on what you can have in the text (the hyphen is always between letters? it can have other characters? etc). For example, if it is always letters and the hyphen always has a letter before and after, an alternative is https://regex101.com/r/jM2CPS/1/ (remembering that in this case I am using the \w in Unicode mode, i.e., it already picks up the accented letters)

  • @giovannijakubiak If you are using Java, I still find it easier to do texto.split(" - ") and see if the result is an array of 3 positions (and for each position you check if it is a valid text, you can even use a simpler regex). If this is the case, you can even ask another question with more specific examples, saying exactly the rules that define what is a valid text, etc

  • regex101.com/r/jM2CPS/1 is the ideal solution. I changed one of its alternatives and arrived at this solution https://regex101.com/r/JkPoJr/4/ but there I disregard spaces. Thank you so much for your help, you saved the call! About Java, I’m looking to leave the code very generic because of the other existing regex standards, so it would not be ideal to put specifics of certain titles in the code. Thanks again!

Browser other questions tagged

You are not signed in. Login or sign up in order to post.