How to hyphenate between string characters

Asked

Viewed 154 times

0

Separating the characters of a string with a Python hyphen works perfectly this way:

import re

regex = r"\B(?=(.{1}))"

test_str = "Pêssego"

subst = "-"

result = re.sub(regex, subst, test_str, 0)

if result:
    print (result) // P-ê-s-s-e-g-o

But not in Javascript:

const regex = /\B(?=(.{1}))/g;
const str = `Pêssego`;
const subst = `-`;

const result = str.replace(regex, subst);

console.log(result); // Pês-s-e-g-o

Should I add something? Take something out? What is the difference between the two? Is there any other way to separate a string with real-time hyphens in an input field?

  • 4

    console.log("Pêssego".split("").join("-"))

  • 2

    There is a way without regex to do: 'Pêssego'.split('').join('-');

  • 3

    Or use the iteration protocol implemented by the strings: [...'Pêssego']. :-)

  • @Augustovasques Well thought out. Using this method, making the change while typing would be possible? In regex I would have something like this in a keyup function: this.value = this.value.replace(/ /g,''); var string = this.value; this.value = string.replace(/\B(?=(.{1}))/g, "-");

  • 2

    For the record, the regex could be just \B(?=.), for the {1} is redundant: in general (qualquer coisa){1} is the same as (qualquer coisa)

1 answer

3


Although the expression is the same, you just noticed a fundamental feature of regex: they don’t work exactly the same in all languages/Engines/tools, because although the "basic" functioning is the same (or at most similar), each one implements in a way and as the saying goes, "the devil is in the details".

In Python, by default, regex are compatible with Unicode (we can say, so well summarized, that they go "beyond ASCII"). This means that shortcuts as \w, \b and even the . work with Unicode code points, and consider accented letters and other alphabets.

But in Javascript, many shortcuts work only with ASCII, not considering accented characters. This is the case of \w (which means "letters of a to z (upper and lower case), digits of 0 to 9 and the character _", without considering accents). And as shortcuts \b and \B indicate positions of the string taking into account whether before and after have alphanumeric characters or not (see detailed explanation here), these also do not consider accented letters (in Python, the \w considers any letter of any alphabet, including accents, so the \b and \B also work in these cases).


In the comments were given alternatives to do this without regex, but has a detail: the result will only be equal if the string has only alphanumeric characters. See the example below:

function test(str) {
    const regex = /\B(?=(.{1}))/g;
    const subst = '-';
    console.log(`------------\nTestando: '${str}'`);
    // usando sua regex
    console.log(`regex: ${str.replace(regex, subst)}`);
    // usando split (conforme sugerido nos comentários)
    console.log(`split: ${str.split("").join("-")}`);
}

test('Pêssego');
test('Oi, tudo bem?');

The result is:

------------
Testando: 'Pêssego'
regex: Pês-s-e-g-o
split: P-ê-s-s-e-g-o
------------
Testando: 'Oi, tudo bem?'
regex: O-i,- t-u-d-o b-e-m?
split: O-i-,- -t-u-d-o- -b-e-m-?

Note that when there are non-alphanumeric characters (such as spaces and punctuation marks), your regex only places the hyphen between two alphanumeric characters or between two non-alphinical characters (note that it has been placed between the comma and the space immediately after it), because the \B indicates exactly this: a position of the string in which the characters before and after are of the same "type" (either both are alphanumeric, or both are not). But when there is only one space separating 2 words, the hyphen is not placed between them (the same goes for the end: before the ? the hyphen is not placed, because this is a position in which before there is a letter and then no).

But using split, a hyphen is placed between all characters, regardless of whether they are alphanumeric or not.


Anyway, if you want to hyphenate between all string characters, regardless of whether they are alphanumeric or not, so use split/join (or use the Operator spread: [...str].join("-")).

But if the idea is to put the hyphen just between the letters of the words, then you have to change the solution a little bit.

If you really want to use regex, one option is to use Unicode Property escapes - but see before if your browser/environment is compatible (the moment I write, only IE is not):

const regex = /(\p{L})(?=\p{L})/gu;
const str = 'Pêssego';
const subst = '$1-';
const result = str.replace(regex, subst);

console.log(result); // P-ê-s-s-e-g-o

console.log('Oi, tudo bem?'.replace(regex, subst)); // O-i, t-u-d-o b-e-m?

In the case, \p{L} is any letter that is defined by Unicode (including accented letters and other alphabets). The idea is to take a letter ((\p{L})), provided that it is followed by another letter ((?=\p{L})).

The first letter is in brackets to form a capture group (and since it’s the first pair of parentheses, then it’s group 1). And in the substitution string I use $1 to catch what was captured in group 1, and put the hyphen next.

It is worth noting the use of flag u (enables "Unicode mode"), without which Unicode Property escapes do not work properly.

With this the cases where there is more than one word are treated correctly: note how in the string 'Oi, tudo bem?' hyphens are only placed between letters (but again, if the intention is to hyphenate all characters, I would choose to use split/join to be - in my opinion - simpler).


Another option is to use lookbehind:

const regex = /(?<=\p{L})(?=\p{L})/gu;
const str = 'Pêssego';
const subst = '-';
const result = str.replace(regex, subst);

console.log(result); // P-ê-s-s-e-g-o
console.log('Oi, tudo bem?'.replace(regex, subst)); // O-i, t-u-d-o b-e-m?

Thus, I take the positions of the string that has a letter before (indicated by lookbehind (?<=\p{L})) and a letter afterwards (indicated by Lookahead (?=\p{L})), and insert the hyphen into these positions. At the bottom, it is a way to simulate the \B (but considering only the case where both are letters - remember that \B also consider cases where both are not alphanumeric).


If you want to limit to Portuguese characters only, you can also use something like:

const regex = /([a-záàâãéèêíïóôõöúç])(?=[a-záàâãéèêíïóôõöúç])/gi;

Or any of the other listed options in this question.


Finally, it is worth emphasizing that there is another difference. Even if the \B worked for accented characters, would still give difference to the case where there are two or more non-alphanumeric characters followed:

const str = 'abc...';
const regex1 = /(\p{L})(?=\p{L})/gu;
const regex2 = /\B(?=.)/g;
console.log(str.replace(regex1, "$1-")); // a-b-c...
console.log(str.replace(regex2, "-"));   // a-b-c.-.-.

In the first case regex will only put the hyphen between two letters (because I explicitly check by letters - \p{L}), while the \B checks string positions where the characters before and after are of the same "type" (both alphanumeric or both non-alphinical). So he also inserts the hyphen between two characters . and this gives difference in the final result.


On the question of "change the field in real time", can take a look here and here

  • Wow, I had no idea there was such a discrepancy between the two. Thank you very much, it was very helpful. And about real time, I was able to implement with both and the best option was with regex. If you can recommend a good book or videos to learn regex I would be very grateful!

  • 1

    @Lordepina About regex, two sites that I like are that and that. And books, I recommend that and that

Browser other questions tagged

You are not signed in. Login or sign up in order to post.