Regex how to separate by groups each occurrence

Asked

Viewed 72 times

3

I’m trying to add attributes to a tag tag <a> from a parse of Markdown (markdown => html).

In my document markdown i add parentheses and the markup I want right after declaring the links, for example:

[Cool Text](https://hiperlynck "title")(class="ext-link-icon" data-super="..." foo="bar")

The parser does his job properly and returns me only the substitution of the marking that he recognizes:

<a href="https://hiperlynk" title="title">Cool Text</a>(class="ext-link-icon" data-super="..." foo="bar")

From this point on I have to find what I added between parentheses at the end of the markup markdown and add them detro from the opening tag <a>. I’m using the following RegEx: /(<a.+<\/a>)\((.+=".+" ?)+\)/g

The code below is what I have for now:

let regex = /(<a.+<\/a>)\((.+=".+" ?)+\)/g
let str = '<a href="https://hiperlynk" title="title">Cool Text</a>(class="ext-link-icon" data-super="..." foo="bar")'.replace(regex, (match, $1, $2) => {
    if ( !$1 && !$2 ) {
        let url = match.match(/"(.*?)"/)[1]
        // checar se é link local ou para o mesmo hostname
        if ( url.includes(window.location.hostname) || url[0] == '/' || url[0] == '.' || url[0] == '#' ) {
            // caso seja link local, retorna
            return match
        }
        // aqui assume não ser um link local e adiciona atributos
        let allHrefContent = match.match(/^<a (.*?)>/)[1];
        if ( !allHrefContent.includes('target="') ) {
            allHrefContent += ' target="about:blank"'
        }
        allHrefContent += ' rel="noopener noreferrer"'
        return `<a ${allHrefContent}>${match.match(/>(.*?)</)[1]}</a>`
    } else {
        // aqui a segunda ocorrência é tudo aquilo que foi adicionado entre parenteses do `markdown` após o link
        if ( /^(rel=")/.test($2) ) {
            let rel = $2.replace(/rel="|"/g, '');
            if ( !rel.includes('noopener') ) {
                rel += ' noopener'
            }
            if ( !rel.includes('noreferrer') ) {
                rel += ' noreferrer'
            }
            return $1.replace('">', `" target="about:blank" rel="${rel}" ${$2}>`)
        } else {
            return $1.replace('">', `" target="about:blank" rel="noopener noreferrer" ${$2}>`)
        }
    }
})

document.body.innerHTML = str
console.log(str)
.ext-link-icon {
  background: url("data:image/svg+xml;charset=UTF-8,%3Csvg xmlns='http://www.w3.org/2000/svg' width='24' height='24' viewBox='0 0 24 24' fill='rgb(51, 103, 214)'%3E%3Cpath d='M19 19H5V5h7V3H5c-1.11 0-2 .9-2 2v14c0 1.1.89 2 2 2h14c1.1 0 2-.9 2-2v-7h-2v7zM14 3v2h3.59l-9.83 9.83 1.41 1.41L19 6.41V10h2V3h-7z'/%3E%3C/svg%3E") right/12px no-repeat;
  padding-right: 0.875em;
}

It works well with just one link but, if there is more than one break and I am not able to formulate the logic to group the occurrences.

Example with more than one link:

let regex = /(<a.+<\/a>)\((.+=".+" ?)+\)/g
let str = '<a href="../">Voltar</a>(class="patinho-feio") qualquer coisa aqui <a href="https://hiperlynk" title="title">Cool Text</a>(class="ext-link-icon" data-super="..." foo="bar")'.replace(regex, (match, $1, $2) => {
    if ( !$1 && !$2 ) {
        let url = match.match(/"(.*?)"/)[1]
        // checar se é link local ou para o mesmo hostname
        if ( url.includes(window.location.hostname) || url[0] == '/' || url[0] == '.' || url[0] == '#' ) {
            // caso seja link local, retorna
            return match
        }
        // aqui assume não ser um link local e adiciona atributos
        let allHrefContent = match.match(/^<a (.*?)>/)[1];
        if ( !allHrefContent.includes('target="') ) {
            allHrefContent += ' target="about:blank"'
        }
        allHrefContent += ' rel="noopener noreferrer"'
        return `<a ${allHrefContent}>${match.match(/>(.*?)</)[1]}</a>`
    } else {
        // aqui a segunda ocorrência é tudo aquilo que foi adicionado entre parenteses do `markdown` após o link
        if ( /^(rel=")/.test($2) ) {
            let rel = $2.replace(/rel="|"/g, '');
            if ( !rel.includes('noopener') ) {
                rel += ' noopener'
            }
            if ( !rel.includes('noreferrer') ) {
                rel += ' noreferrer'
            }
            return $1.replace('">', `" target="about:blank" rel="${rel}" ${$2}>`)
        } else {
            return $1.replace('">', `" target="about:blank" rel="noopener noreferrer" ${$2}>`)
        }
    }
})

document.body.innerHTML = str
console.log(str)
.ext-link-icon {
  background: url("data:image/svg+xml;charset=UTF-8,%3Csvg xmlns='http://www.w3.org/2000/svg' width='24' height='24' viewBox='0 0 24 24' fill='rgb(51, 103, 214)'%3E%3Cpath d='M19 19H5V5h7V3H5c-1.11 0-2 .9-2 2v14c0 1.1.89 2 2 2h14c1.1 0 2-.9 2-2v-7h-2v7zM14 3v2h3.59l-9.83 9.83 1.41 1.41L19 6.41V10h2V3h-7z'/%3E%3C/svg%3E") right/12px no-repeat;
  padding-right: 0.875em;
}


I confess that RegEx is not my thing ... so the brief question is: how can I capture two or more groups by following this parameter?

<a>text</a>(class="foo") qualquer coisa aqui <a>text</a>(class="bar")

So you can arrive at the expected result:

<a class="foo">text</a> qualquer coisa aqui <a class="bar">text</a>

From already grateful for any help lead me to understand the problem.

  • 1

    Using DOM and using nextSibling will probably take the following text from the specified elements and with removeChild you can remove them, and having the value of #textnode will be able to easily convert into attribute for the previous element. In short, if I understand your regex doubt seems totally expendable for this case.

  • @Guilhermenascimento I didn’t quite understand your suggestion. I’m manipulating a string before accommodating the DOM...pq would put in the DOM to then remove it? Sorry if I got it wrong.

  • 1

    You want to convert the texts (class="foo") for the preceding elements, right? And these texts after the class attribute added to the element, or got it wrong?

  • I want to take what’s in square brackets () in the case class="foo" and add inside the tag <a> before accommodating them to DOM ... when releasing to DOM they will already be with the attributes. I can reach this result with a link on string bad, not with 2 or more

  • 1

    Well, it looks exactly like what I said :) ... so could use domparsed = DOMParser.parseFromString(sua string aqui, "text/html") to treat your text before adding as DOM on the page and then selects all A elements with var links = domparsed.getElementsByTagName('a') and then with a for goes seeing item by item of links and each one of them you check the nextSibling, with this property you get the value of the text (class="nome da classe")

  • Do not parse HTML with regex. Regular expressions are not a sufficiently sophisticated tool to understand the constructs employed by HTML. See Analyzing Cthulhu-like Html

  • 1

    @Augustovasques didn’t have time to "optimize", but I roughly left something ready https://answall.com/a/495414/3635 ... I hope I understood the question.

Show 2 more comments

1 answer

2


From what I understand the question you have generated:

<a href="https://hiperlynk" title="title">Cool Text</a>(class="ext-link-icon" data-super="..." foo="bar")

You want to take the custom attributes that your "parser" initially does not, the second part, that is using this regex /(<a.+<\/a>)\((.+=".+" ?)+\)/g can work in a loop with String.prototype.match(), but this regex beyond problematic (the way it was done) would require that the values taken within the parentheses be treated separately, which is very laborious.

As far as I understand all this is a string and not DOM yet, so the treatment can be done with new DOMParser().parseFromString(string, "text/html") and to pick up the texts in the sequence of the elements use the Node.nextSibling (that different of Node.nextElementSibling will pick up texts and/or elements)

A more practical example:

let preParsedData = `
<a href="https://hiperlynk" title="title">Cool Text</a>(class="ext-link-icon" data-super="..." foo="bar")
foo bar
<a>text</a>(class="foo") qualquer coisa aqui <a>text</a>(class="bar")
foo bar
<a>invalido</a><br>
<a>invalido</a>`;

// Faz o parse da string
let docParsed = new DOMParser().parseFromString(preParsedData, "text/html");

// Cria um template para gerar os atributos em elementos "falsos"
let template = docParsed.createElement('template');

// Pega todos elementos A
let anchors = docParsed.getElementsByTagName("a");

// Interage todos elementos ancora
for (let mainElement of anchors) {

    // Obtêm o próximo node (que pode ser texto, html ou NULL)
    let node = mainElement.nextSibling;

    // Checa se não é nulo e se é texto, também checa se começa com (    
    if (!node || !node.nodeValue || node.nodeValue[0] !== '(') {
        continue;
    }

    // Obtêm a posição do )
    let text = node.nodeValue, lastChar = text.indexOf(')');

    // Se não encontrar o ) significa que não é uma sintaxe válida
    if (lastChar === -1) {
        continue;
    }

    // Remove do texto a parte que representava os atributos
    node.nodeValue = text.substr(lastChar + 1);

    // pega os atributos para aplicar diretamente a um elemento "temporário"
    let strAttrs = text.substr(1, lastChar - 1);

    template.innerHTML = `<div ${strAttrs}></div>`;

    // Pega todos atributos do elemento temporário
    let attrs = template.content.firstChild.attributes;
    
    for (let attribute of attrs) {

        // Remove o atributo do elemento temporário
        let deatch = attrs.removeNamedItem(attribute.name);

        // Aplica o atributo removido para o elemento A
        mainElement.attributes.setNamedItem(deatch);
    }
}


// Obtêm o resultado em string (se for necessário)
let results = docParsed.body.innerHTML;

console.log(results);

This is just a suggestion, I have not tested micro-optimization questions, the goal is more understanding. Out of the methods already cited at the beginning of the answer was also used:

  • NamedNodeMap.removeNamedItem() to remove (detach) an attribute
  • NamedNodeMap.setNamedItem() to add an attribute
  • Element.attributes to obtain the NamedNodeMap element (list of attributes)

And of course some simple string operations.

  • Thank you William, I believe your answer answers the question. I followed your comments yesterday and arrived something little different (meeting my needs) using DOMParser(). First time I used DOMParser(), it seemed easy, it was good to have learned something new. In case you want to take a look at what I did with the suggestions you gave me put in a fiddle. Once again grateful.

  • "As far as I understand all this is a string and not DOM yet". Yes you got it right

  • 1

    Not bad @Lauromoraes, but this using spread for simple things and still had to use regex made it a little more complicated than it could be. Anyway this good still yes, I’m glad I could help you.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.