How to remove characters q, ç and g from the HTML <u> tag in Regex?

Asked

Viewed 187 times

4

I am converting HTML to doc, and I came across a very specific problem.

Words that contain the letters q, g and ç when underlined, are getting an unwanted result.

I thought I’d settle this way:

<u>Administração</u>
<!-- Esperado -->
<u>Administra</u>ç<u>ão</u>

It is possible and how to do with regex?

I’m open to suggestions.

  • The solution can be written in any language
  • I am converting HTML -> doc with Libre Office
  • HTML is limited to not using styles/css
  • 1

    The goal is to have the words underlined to the halves ? This seems a little strange. Wouldn’t it be better to remove the underlining altogether ?

  • Removing the underlining entirely is not an alternative. The objective is the same that was described, removing the underlining of the letters q, g and ç the reading is more pleasant, because the risk of underlining is cutting the "tails/feet" of these letters. I believe that one of the answers given will already solve, with time today I test.

  • One day (when all browsers implement support full) it will be possible to use that

  • 1

    Consider marking @hkotsubo’s reply as correct. From the comments, all doubts have been resolved and it seems to me to be quite complete.

2 answers

4


Although it is possible to solve with regex, it may be easier to manipulate HTML using some specific API for this. As it was said that any language serves, one option is to use Javascript:

let elements = document.querySelectorAll('u');
for (let u of elements) {
    let parent = u.parentNode;
    for (let child of u.childNodes) {
        if (child.nodeType == Node.TEXT_NODE) {
            child.nodeValue.split(/([çqg])/g).forEach(s => {
                if (/^[çqg]$/.test(s)) {
                    parent.insertBefore(document.createTextNode(s), u);
                } else {
                    let novo = document.createElement('u');
                    novo.appendChild(document.createTextNode(s));
                    parent.insertBefore(novo, u);
                }
            });
        } else {
            // não é texto, preserva do jeito que está (dentro de outro "u")
            let novo = document.createElement('u');
            novo.appendChild(child.cloneNode(true));
            parent.insertBefore(novo, u);
        }
    }
    parent.removeChild(u);
    console.log(parent.outerHTML);
}
<p><u>Administração</u> teste</p>
<p>Outro <u>teste que agora etc...</u> lorem ipsum</p>
<p>Outro <u>teste agora <b>que</b> agora vai</u></p>

First I get all the elements <u>, and for each one I go through the elements children, looking for some that is of the type TEXT_NODE. If it is, I’ll make one split with [çqg] (any of these letters), and I place this section of the regex in parentheses, so the letters are also returned in the final result.

Then I go through the result of split, verifying two cases:

  • if it’s one of the letters ç, q or g, I insert it as a text Node
  • otherwise, I’ll create another u and insert the text

Finally, I insert all these elements just before the <u> original (to maintain the relative position if there are more sibling elements in the DOM), and in the end I remove the <u> original, keeping the new nodes created.

There is also a else to deal with cases that are not text nodes (for when there are other tags inside the <u>, and in this case I just re-enter the same element, without modification - but putting it inside another u, so that you don’t lose formatting).

I also printed out the outerHTML just for you to see the final HTML, because in the rendering of the browser - at least for me - it was not very clear which letters ç, g and q were left out of the tags u.

Also note that in the third case, the text between tags <b> was not modified. I did so because it was not clear what to do in this case. In the original tag we had:

<p>Outro <u>teste agora <b>que</b> agora vai</u></p>

Then the result should be which of the options below?

<p>Outro <u>teste a</u>g<u>ora </u><b>q</b><u><b>ue</b></u><u> a</u>g<u>ora vai</u></p>

<p>Outro <u>teste a</u>g<u>ora </u><u><b>que</b></u><u> a</u>g<u>ora vai</u></p>

I chose the second, because it is simpler (but of course it is possible to adapt the above code to deal with these more complicated cases).


Although it is possible a single regex to replace everything, this is not the most suitable tool to handle HTML. The vast majority of languages have some lib (native or not) to handle HTML/XML, and with them you can handle these more complicated cases (like having tags inside other, comments, fields CDATA, etc), which with regex is much more difficult to treat.

Of course, for simpler cases (a tag <u> with only text inside it, without any other tag), the solution proposed by Andrei works very well. Another alternative to his regex is (in PHP, which is what he used):

$texto = "<u>Administração</u>";
echo preg_replace('/[qpç]/u', '</u>$0<u>', $texto);

Or in Javascript:

console.log('<u>Administração</u>'.replace(/[qpç]/g, '</u>$&<u>'));

Again, this regex works well when there are no other tags inside the <u>. But if you have another tag inside the <u> (as in the other example above), there doesn’t work that well anymore. A string like that:

<u>teste agora <b>que</b></u>

Becomes:

<u>teste a</u>g<u>ora <b></u>q<u>ue</b></u>

Look at the stretch <b></u>q<u>ue</b>: closes the u without closing the b, then open another u and closes the b. It messed everything up. And the regex to treat these cases would be much more complex...

It is up to you to assess whether a simpler regex already covers all your cases, and - as important as - whether it does not generate invalid cases. Remember that a regex handles text, regardless of semantics and/or structure (then put <u> anywhere in the text is indifferent to it), but an HTML has a well-defined structure and you can’t add anything anywhere. Therefore a Specific API is usually better than simple text manipulation (except in simpler cases, as already mentioned).


Based on what was said in his comment (you do not want the underlined "cut" letters), another option - if you can use CSS - is to use the property text-decoration-skip-ink.

u.skip {
  text-decoration-skip-ink: auto;
}

u.dontskip {
  text-decoration-skip-ink: none;
}
<p><u class="skip">Com skip Administração que agora etc...</u></p>

<p><u class="dontskip">Sem skip Administração que agora etc...</u></p>

Unfortunately, it is not yet all browsers that support this feature (at this time, only Chrome and Opera support it). If the browser you use does not support, follow an image of how the above example looks:

inserir a descrição da imagem aqui

  • For the result obtained with the function "execute" the code here of the stack, I believe that the solution will be this, as soon as possible I will test in my code.

  • @Skywalker I made a change to the first code, I saw that I had a failed case.

2

I believe you want this:

$texto = "<u>Administração</u>";

echo preg_replace('/([qpç]+)/u', '</u>$1<u>', $texto);

That one REGEX will work if you have the characters mentioned or not.

  • 2

    In case the options are only one letter, it is simpler to use [qpç]. Anyway, I think it’s worth mentioning that this only works if the tag u has no other tag inside it - because if it has, the end result will be an invalid HTML: https://regex101.com/r/IolPTb/1/ (of course this is not mentioned by AP, but you will know...) :-)

  • @hkotsubo yes! He didn’t mention it, so I simplified it. In the case when I used the [qpç] in php it returned a wrong encoding. I don’t know why yet. So I put it this way.

  • Ah, I think the problem is ç, that within [] may give problem depending on the encoding. But using the flag u vc forces regex to use UTF-8: preg_replace('/([qpç]+)/iu', '</u>$1<u>', $texto); - see the difference: with the flag and without the flag. And +1 late - yesterday I commented, went to sleep and forgot to vote :-)

  • 1

    @hkotsubo is right, my mistake in leaving no more comments, but this answer adds the question. Thank you.

  • Another detail is that, from what I understand of this comment, He only wants the lower case letters, so he better remove the flag i of regex

  • 1

    @hkotsubo worth! I will edit!

  • 1

    @hkotsubo didn’t know about the flag U, I was trying to do an encoding for utf8... kkk Valeu!

  • 1

    Just remembering that the flag is u minuscule. The U there is also a capital letter, but it does something else: https://www.php.net/manual/en/reference.pcre.pattern.modifiers.php

Show 3 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.