Although it is possible to solve with regex, it may be easier to manipulate HTML using some specific API for this. As it was said that any language serves, one option is to use Javascript:
let elements = document.querySelectorAll('u');
for (let u of elements) {
let parent = u.parentNode;
for (let child of u.childNodes) {
if (child.nodeType == Node.TEXT_NODE) {
child.nodeValue.split(/([çqg])/g).forEach(s => {
if (/^[çqg]$/.test(s)) {
parent.insertBefore(document.createTextNode(s), u);
} else {
let novo = document.createElement('u');
novo.appendChild(document.createTextNode(s));
parent.insertBefore(novo, u);
}
});
} else {
// não é texto, preserva do jeito que está (dentro de outro "u")
let novo = document.createElement('u');
novo.appendChild(child.cloneNode(true));
parent.insertBefore(novo, u);
}
}
parent.removeChild(u);
console.log(parent.outerHTML);
}
<p><u>Administração</u> teste</p>
<p>Outro <u>teste que agora etc...</u> lorem ipsum</p>
<p>Outro <u>teste agora <b>que</b> agora vai</u></p>
First I get all the elements <u>
, and for each one I go through the elements children, looking for some that is of the type TEXT_NODE
. If it is, I’ll make one split
with [çqg]
(any of these letters), and I place this section of the regex in parentheses, so the letters are also returned in the final result.
Then I go through the result of split
, verifying two cases:
- if it’s one of the letters
ç
, q
or g
, I insert it as a text Node
- otherwise, I’ll create another
u
and insert the text
Finally, I insert all these elements just before the <u>
original (to maintain the relative position if there are more sibling elements in the DOM), and in the end I remove the <u>
original, keeping the new nodes created.
There is also a else
to deal with cases that are not text nodes (for when there are other tags inside the <u>
, and in this case I just re-enter the same element, without modification - but putting it inside another u
, so that you don’t lose formatting).
I also printed out the outerHTML
just for you to see the final HTML, because in the rendering of the browser - at least for me - it was not very clear which letters ç
, g
and q
were left out of the tags u
.
Also note that in the third case, the text between tags <b>
was not modified. I did so because it was not clear what to do in this case. In the original tag we had:
<p>Outro <u>teste agora <b>que</b> agora vai</u></p>
Then the result should be which of the options below?
<p>Outro <u>teste a</u>g<u>ora </u><b>q</b><u><b>ue</b></u><u> a</u>g<u>ora vai</u></p>
<p>Outro <u>teste a</u>g<u>ora </u><u><b>que</b></u><u> a</u>g<u>ora vai</u></p>
I chose the second, because it is simpler (but of course it is possible to adapt the above code to deal with these more complicated cases).
Although it is possible a single regex to replace everything, this is not the most suitable tool to handle HTML. The vast majority of languages have some lib (native or not) to handle HTML/XML, and with them you can handle these more complicated cases (like having tags inside other, comments, fields CDATA
, etc), which with regex is much more difficult to treat.
Of course, for simpler cases (a tag <u>
with only text inside it, without any other tag), the solution proposed by Andrei works very well. Another alternative to his regex is (in PHP, which is what he used):
$texto = "<u>Administração</u>";
echo preg_replace('/[qpç]/u', '</u>$0<u>', $texto);
Or in Javascript:
console.log('<u>Administração</u>'.replace(/[qpç]/g, '</u>$&<u>'));
Again, this regex works well when there are no other tags inside the <u>
. But if you have another tag inside the <u>
(as in the other example above), there doesn’t work that well anymore. A string like that:
<u>teste agora <b>que</b></u>
Becomes:
<u>teste a</u>g<u>ora <b></u>q<u>ue</b></u>
Look at the stretch <b></u>q<u>ue</b>
: closes the u
without closing the b
, then open another u
and closes the b
. It messed everything up. And the regex to treat these cases would be much more complex...
It is up to you to assess whether a simpler regex already covers all your cases, and - as important as - whether it does not generate invalid cases. Remember that a regex handles text, regardless of semantics and/or structure (then put <u>
anywhere in the text is indifferent to it), but an HTML has a well-defined structure and you can’t add anything anywhere. Therefore a
Specific API is usually better than simple text manipulation (except in simpler cases, as already mentioned).
Based on what was said in his comment (you do not want the underlined "cut" letters), another option - if you can use CSS - is to use the property text-decoration-skip-ink
.
u.skip {
text-decoration-skip-ink: auto;
}
u.dontskip {
text-decoration-skip-ink: none;
}
<p><u class="skip">Com skip Administração que agora etc...</u></p>
<p><u class="dontskip">Sem skip Administração que agora etc...</u></p>
Unfortunately, it is not yet all browsers that support this feature (at this time, only Chrome and Opera support it). If the browser you use does not support, follow an image of how the above example looks:
The goal is to have the words underlined to the halves ? This seems a little strange. Wouldn’t it be better to remove the underlining altogether ?
– Isac
Removing the underlining entirely is not an alternative. The objective is the same that was described, removing the underlining of the letters q, g and ç the reading is more pleasant, because the risk of underlining is cutting the "tails/feet" of these letters. I believe that one of the answers given will already solve, with time today I test.
– Skywalker
One day (when all browsers implement support full) it will be possible to use that
– hkotsubo
Consider marking @hkotsubo’s reply as correct. From the comments, all doubts have been resolved and it seems to me to be quite complete.
– Andrei Coelho