How to identify words that start with a character and end with the same plus a space using regex?

Asked

Viewed 40 times

1

inserir a descrição da imagem aqui

/(\*[ ]|\_|\~|\@\@|\%\%)([^\s].*?)\1/gm

When I add the expression *[ ] (highlighted in red) to consider the space after the * it starts to consider everywhere and I need it to be only at the end of the word.

So what I need to do in regex is that it be considered * followed by any character at the beginning and ? with space only at the end.

In green I highlighted the words I want to capture with the expression.

I’m trying to perform this operation on JS.

  • Please click on [Edit] and put the code (in this case, regex) as text. Putting it as an image is not ideal, understand the reasons reading the FAQ. Taking advantage, it would be good to also put the language or tool/engine you are using, because each one implements regex in one way and what works for one may not work for another

  • Done, @hkotsubo! Thanks for the guidance

  • In fact, the regex has things that don’t seem to have anything to do with the case: \*[ ] takes an asterisk followed by space (that’s why it takes the penultimate case), and has other things not included in the description of the problem, such as the \_ and \~, for example (it looks like there’s too much stuff in there) - Between the asterisks can you have anything? Or just "words" (letters and/or digits, etc.). If anything, then there could be other asterisks and/or spaces, punctuation marks, etc... depending on what you need, the regex will be different...

  • 1

    Use '(\*).+\1\s', if the engine supports backreference: https://www.regular-expressions.info/backref.html

  • Although in that case, I think that it would be sufficient to remove the space after the asterisk, that is, remove the [ ] and leave only \* - Ah, the first test case has no room after the second *, (or has it and I haven’t seen it? ) then it should be caught by regex?

  • Or else /(\*|\_|\~|\@\@|\%\%)([^\s].*?)\1 /gm (asterisk without space after, followed by several characters, ends with asterisk and space (note that you have a space before the last bar). As for the gm in the end, it depends on how you’re doing the search (g serves to bring all occurrences, is useful if the string has more than one and you want all, and m serves to change the behavior of ^ and $, but since regex doesn’t have that, then m is of no use in this case)

  • 1

    @Lucas Essa regex also picks up the penultimate case (* Teste*), but I understand he only wants the ones marked in green...

  • Guys, thank you so much for the help! With the guidance of you I managed to evolve here, thank you!

Show 3 more comments

1 answer

2


It depends on what you need, but from the reported cases, the rules seem to be:

  • after the first asterisk can not have space (i.e., it is not "any character" as stated in the question)
  • after the second asterisk has to have a space

So it would be something like:

/(\*)[^\s][^*]*\* /

That is, it has the asterisk, then it has a character that is not space ([^\s]), then has zero or more characters other than an asterisk ([^*]*), and then has the asterisk itself followed by space (note that there is a space before the second bar).

Although it won’t take the first case (*Teste*), because it seems that there is no space after the second asterisk. In this case, if you want to also catch asterisk at the end, it would have to be:

/(\*)[^\s][^*]*(?<! )\*( |$)/

In the end we have ( |$) (a space or the end of the string), but before the second asterisk also includes a negative lookbehind to ensure that before he has a space (otherwise he would also take *Teste *, but you implied that it’s not to take this case).


And of course this is only for the asterisk, but your regex is considering several other cases, such as _, ~, @@ and %%, then I would be:

/(\*|_|~|@@|%%)([^\s].*?)\1 /

Or, if you want to consider the case where the second occurrence may be at the end (and not necessarily followed by space), in addition to not having space before the second occurrence:

/(\*|_|~|@@|%%)([^\s].*?)(?<! )\1( |$)/

Ex:

var regex = /(\*|_|~|@@|%%)([^\s].*?)(?<! )\1( |$)/;
var textos = ['*Teste*', '*Teste *', '*Teste*ok', '* Teste* ok', '*Teste* ok', '@@teste@@'];
for (var texto of textos) {
    if (regex.test(texto)) {
        console.log(texto);
    }
}


Remembering that the flags gm that you used may or may not make a difference:

  • the g fetches all occurrences (ie if the string has more than one will bring all). If you just want to catch an occurrence, the g is not necessary
  • the m changes the behavior of $: the normal is that it indicates only the end of the string, but with the m, the $ also indicates the end of a line

Finally, the short cut \s not only picks up spaces, as it can also consider other things like TAB and line breaks (see here for more details). If you only want spaces, exchange it for a single space.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.