String handling with Regex

Question

String handling with Regex

Asked 4 years, 6 months ago

Viewed 148 times

2

I have the following string:

const string = 'Isto é uma frase $[var_test["aaaa"].bbb], mais coisas $[var_test["ccc"].ddd].eee.fff mais coisas.'

The dynamic elements are: var_test | aaa | bbb | ccc | ddd | Eee | fff | This is one sentence | , more things | more things.

How best to return the following output:

Isto é uma frase $[var_test["aaaa"].bbb], mais coisas $[var_test["ccc"].ddd.eee.fff] mais coisas

I thought about doing with regex, and I did the following regex, but I must be missing something, because it matches with: $[var_test["aaaa"].bbb], mais coisas $[var_test["ccc"]., and should match me with $[var_test["ccc"].ddd].eee.fff.

var str = 'Isto é uma frase $[var_test["aaaa"].bbb], mais coisas $[var_test["ccc"].ddd].eee.fff mais coisas';
var patt = new RegExp(/\$\[.*?\[.*?].*?]\..*?\b/);

Can someone give me some tips to solve/work/manipulate strings in the best way?

2 answers

3

A question when we work with regular expression and say the same that it should take any character zero or more times .*, is that it will do this even for special characters, and when we put some eyeliner at the end of the expression, this eyeliner is also contained in terms of any character, and therefore our regular expression will take everything that is contained within the first possible passage until the last. In that case $[ and ]..

I don’t know all the combination possibilities that your string can assume, and in general problems with regular expression can have more than one solution. I will suggest a solution assuming that the string which you wish to accomplish the match shall never contain a blank.

In a regular expression, to add the un-contained validation rule, use the circumflex accent with its expression between brackets [^ ]. You can thus search for any character except the set defined.

If we use this idea of not contained, we can change the references of .* in its expression by [^ ], generating the code below:

var patt = new RegExp(/\$\[[^ ]*\[[^ ]*\][^ ]*\]\.[^ ]*/);

I also put bar before closing brackets that were missing \].

If you want to get a list of all codes started with $[ in your string, going from that beginning to the last non-blank and noncomma character, use this expression:

var patt = new RegExp(/\$\[[^ ]*\]\.[^ ,]*/g);

To learn more about regular expression rules in Javascript, go to here.

Thank you very much for the information that helped me understand, as a solution, I used the following regex test.replace(/($[ w*["[ w-]"].\w)(])(.[\w. ]*)/g, "$1$3$2"); Now a situation happens the last character if it is a . I want to discard the same. How could I do this denial? (Example https://jsfiddle.net/85unb1hy/)

– Cláudio Hilário

2020/05/05 at 01:18
However I have already solved . with the following regex: ($[ w*["[ w-]"].\w)(])(.[\w.]*[^ .])

– Cláudio Hilário

2020/05/05 at 01:22
Here is the full example at work: https://jsfiddle.net/85unb1hy/1/ Thank you very much

– Cláudio Hilário

2020/05/05 at 01:24

Browser other questions tagged javascript string regex

You are not signed in. Login or sign up in order to post.

by hkotsubo • **55,826** points · Answer 1 · 2020-05-05T11:49:17+00:00

I would like to suggest some improvements to the solution of another answer (and also the one you posted in the comments).

In many places you use the quantifier *, which means "zero or more characters". This means that if there is no character, the replacement will also be made. For example, the excerpt $[[""].].x will be replaced by $[[""]..x].

And the final part of the regex accepts things like $[a["b"].].....fff (which in that case shall be replaced by $[a["b"]......fff]). That’s because it was used [\w\.]* (zero or more characters that are \w or dot - ie also accepts several dots in a row).

Of course, if the string only has valid entries and there is no chance of having these false positives, then it is okay to use the regex that was suggested. But if you want to be more precise, you can make some changes.

The first is to change the * for +, meaning "one or more occurrences" (for example, \w* viraria \w+). This makes at least one character mandatory. If you want to be even more precise, you can use other quantifiers, for example \w{3,} (at least 3 occurrences of \w), or \w{3,10} (at least 3, at most 10). Adjust the values to whatever makes the most sense for your case.

And for the final stretch you can use ((?:\.\w+)+). The idea is that the sequence \.\w+ (a point followed by one or more letters, digits or _) repeat once or more. So you avoid cases like ..... And so I don’t even need the [^ .] in the end, because this regex already assures me that in the end can not have point or space. I also put the passage that repeats within a catch group (delimited by (?:), so these parentheses do not create another group and do not interfere with the count used in the substitution (the numbers $1, $2, etc, which you used in his case).

Anyway, I’d be like this:

const str = 'Isto é uma frase $[var_test["aaaa"].bbb], esse não $[[""].x].y mais coisas $[var_test["ccc"].ddd].eee.fff mais coisas $[var_test["ccc"].ddd].eee.fff.';

const newStr = str.replace(/(\$\[\w+\[\"[\w-]+\"\]\.\w+)(\])((?:\.\w+)+)/g, "$1$3$2");
console.log(newStr);

You can still improve more. In the middle you used [\w-] (one \w or a hyphen), i.e., \"[\w-]+\" will accept things like "-----". If the idea is to accept only words with a hyphen separating them (such as "abc-def", "abc-def-ghi", or even without a hyphen, such as "abc", but not to accept "abc--def" and neither "-abc" or "abc-" or "----"), then change this excerpt to \"\w+(?:-\w+)*\" (between the quotation marks we have one or more \w, followed by zero or more occurrences of "hyphen followed by \w+").

Anyway, regex is like that. The more precise and specific it is, the more complicated it gets. It’s up to you to find the balance between accuracy (chance of finding false positives) versus clarity and ease of maintenance. As a general rule, it is important you say clearly what you want and what you don’t want that the regex takes (for example, "I only want one point followed by one character, and I don’t want two or more points followed" - as I did above). On the other hand, if the entries are controlled and you know that there are no cases like the ones already mentioned, then you wouldn’t need to change.