What is the best way to get values within multiple tags in multiple HTML files

Asked

Viewed 35 times

2

I have several HTML pages, so:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
  "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">

<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops">
<head>
    <title>TEXTO</title>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
  <link href="../Styles/stylesheet.css" rel="stylesheet" type="text/css"/>
<link href="../Styles/page_styles1.css" rel="stylesheet" type="text/css"/>
</head>
  <body id="a24" xml:lang="pt-BR" class="calibre">
        <div class="quadro-de-texto-b-sico">
            <h2 class="tit3" id="calibre_pb_0"><img alt="20" src="../Images/00025.jpeg" class="calibre7"/></h2>
            <p class="miolo-sem-ent"><em class="italico">Procure onde dói mais.</em></p>
            <p class="miolo">Bryce se esquivara de dizer a Athalar como o conselho da Rainha Víbora havia sido certeiro. Já tinha dado a ele sua lista de suspeitos... mas o anjo não havia perguntado sobre sua outra exigência.</p>
            <p class="miolo">Então, eis o que decidira fazer: compilar uma lista com cada um dos movimentos de Danika uma semana antes de sua morte. No entanto, no momento em que terminava de abrir a galeria para o dia de trabalho, no momento em que descia até a biblioteca para fazer a lista... a náusea a golpeara.</p>
            <p class="miolo">Então ela ligou o laptop e começou a esmiuçar os e-mails trocados com Maximus Tertian, seis semanas antes. Talvez encontrasse alguma ligação ali... ou, pelo menos, uma pista dos planos do vampiro para aquela noite.</p>
            <p class="miolo">A cada mensagem profissional e insípida que relia, porém, as lembranças dos últimos dias de Danika raspavam a porta vedada de sua mente. Sibilavam e sussurravam como espectros ameaçadores, e Bryce tentava ignorá-las, procurando se concentrar nos e-mails de Tertian, mas...</p>
        </div>
</body>
</html>

I would like to know how to best get values within HTML tags.

Example:

<title>TEXTO</title>

I just want to take TEXTO.

<p class="miolo-sem-ent"><em class="italico">Procure onde dói mais.</em></p>

I just want to take Procure onde dói mais.

This process is repeated in another 10 pages equal to this.

I’ve done something like:

fs.readFile('../output/part0000.html', 'utf-8', function (err, data) {
  if(err) throw err;
  const cutTab = data.replace(/(\r\n|\n|\r)/gm, "");
  const clearText = cutTab.replace(/<p( [a-zA-Z]+="[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+">)|<p class="[a-zA-Z]+">|<em class="[a-zA-Z]+">|<\/em>|-|\s+/g, ' ');
  const clearSpace = clearText.replace(/\s+/g, ' ')
  const endLine= cutTab.replace(/(<\/p>)/g, "</p>#@");

  const cutWord = clearSpace.split('#');
  
  console.log(clearSpace);
});

But this way I’m thinking, this way, I’ll have a lot of work.

Would you have a more practical way? Convert to JSON, XML or TXT?

1 answer

3


As explained above here (and here, here, and mainly here), do not use regex to read/manipulate/do Parsing html.

The ideal is to use libraries specific to HTML/XML. I will give an example with the jsdom, but you can search and use another one if you want. Just so you have an idea of how it would look:

const jsdom = require('jsdom');
const { JSDOM } = jsdom;
const fs = require('fs');

// ler o arquivo html
let html_file = fs.readFileSync('../output/part0000.html', 'utf-8');
// obter o document do arquivo
const document = new JSDOM(html_file).window.document;

// A partir do document, você pode buscar facilmente pelas tags
// Ex: procurando por todas as tags "p" ou "title"
for (const element of document.querySelectorAll('p, title')) {
    console.log(element.textContent); // obtendo o texto da tag
}

That is, how jsdom is possible to obtain the document HTML, and from it it is possible to use css selectors to search for the elements you need.

This is interesting not only because it is simpler than regex, but also because it considers several cases that regex does not take. For example, if you have a commented tag:

<!--
<p>estou dentro de um comentário</p>
-->

Your regex will take this p also, but it should not be caught, because it is commented. The above code with jsdom already ignores the comments correctly. And the fact that you can search with selectors makes it much easier to search (because you can search by class name, id and other attributes, and all other possibilities that selectors offer - doing the equivalent with regex is much more complicated).

This is just one example, of course. Look at the links that are at the beginning for more cases where regex fails (and it’s even possible to make one or more regular expressions that deal with these cases, but it’s so complicated that it’s not worth it - a dedicated lib like jsdom already handles these cases for you).

Browser other questions tagged

You are not signed in. Login or sign up in order to post.