Check if you have the string and include its tags

Asked

Viewed 391 times

2

I would like to take the content of the site, remove only the text and insert my tags, but in this code I did, when he finds the text "Art" he does not leave the if, and then, only the first ones get the tag li, the rest all get the tag ul.

Could someone help me


    # Use the Curl extension to query Google and get back a page of results
    $url = "www.planalto.gov.br/ccivil_03/constituicao/constituicaocompilado.htm";
    $ch = curl_init();
    $timeout = 5;
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
    $html = curl_exec($ch);
    curl_close($ch);

    # Create a DOM parser object
    $dom = new DOMDocument();

    # Parse the HTML from Google.
    # The @ before the method call suppresses any warnings that
    # loadHTML might throw because of invalid HTML in the page.
    @$dom->loadHTML($html);




    # Iterate over all the  tags
    foreach($dom->getElementsByTagName('font') as $link) {

    $mystring = $link->nodeValue;
    $findme   = 'Art';
    $pos = strpos($mystring, $findme);

        if ($pos === false) {

            echo "li";
            echo $link->nodeValue;
            echo "/li";

        } else { 

            echo "/ul";
            echo "ul id='' class='artigo'";
            echo "li";
            echo $link->nodeValue;
            echo "/li";

        }
    }

for the end result to be so

    _ul id="titulo1" class="titulo">
        _h3>TÍTULO I_/h3>
        _p>Dos Princípios Fundamentais_/p>
    _/ul>
    _ul id="titulo1_artigo1" class="artigo">
        _li>
            _ul class="caput">
                _li>
                    Art. 1º A República ... tem como fundamentos:
                _/li>
            _/ul>
        _/li>
        _li>
            _ul class="incisos">
                 _li> I - a soberania;_/li>
                 _li> II - a cidadania_/li>
                 _li> III - o pluralismo político._/li>
            _/ul>
        _/li>
        _li>
            _ul class="paragrafos">
                _li>Parágrafo único. Todo o ... desta Constituição.
                _/li>
            _/ul>
        _/li>

    _/ul>
    _ul id="titulo1_artigo2" class="artigo">
        _li>
            _ul class="caput">
                _li>
                    Art. 2º São Poderes da União, independentes e harmônicos entre si, o Legislativo, o Executivo e o Judiciário.
                _/li>
            _/ul>
        _/li>   
    _/ul>
  • For me all worked normally, until the end appears li.../li, as it appears ul id='' class='artigo'

  • For each line, he should see if he has the word article and apply ul only to articles, but he applies ul to lines that do not have the string "Art"

  • But this is because the ul is in the else, I think you better specify how you want to display the data, for example you want to display the title and then the content of the article? That’s it?

  • But it would not be this correct uro, it checks if it has the word I look for in each $link, if the search is false, it will insert in the $link, only the tag li, but if it finds the word it excesses the part of Else and assigns only to the $link that ul?

  • Not the part of else is an exception, as there are several tags <font> the loop can present anything after if, tell me exactly what you want to bring, because I believe you can better use the API DOMDocument to facilitate the work, just need to understand what REALLY you want to display and order. Edit the question to add the data.

  • The Constitution has, chapters, which has sections, which has articles, which has incisions etc., so I thought I’d wrap it in a nested ul (I don’t know the name in English), so in the programming check it is the title 4 and the title id 4, if it is article 5, in a ul of id Artigo5.

  • for example put a template of what I imagine in the original question

  • I think it gave to have a notion for his example, I believe that of to do everything using only DOMDocument, needlessly strpos, as it is somewhat time-consuming, tomorrow afternoon I try to post a reply friend, if no one has replied.

  • Got it, William. Vlw man

  • Good morning Ale, as you can see provided an answer and example, if you have not clear you can ask, to ask questions.

Show 5 more comments

1 answer

1


Testing your code better, I noticed that the text repeats a few times, this is because of the use of getElementsByTagName which takes the parent element and the child element and the loop presents both with nodeValue, the texts will always repeat. I thought about using Xpath, but the whole problem occurs because this specific HTML document does not have division per block for each content, it simply works with line breaking.

It may be possible to use Xpath or something like that, but apparently it’s quite laborious.

So thinking of line breaks, I thought the following, instead of reading it as DOM, you can read it as text, line by line and detect where the article starts and ends.

To read line by line recommend to use tmpfile(), foef() and fgets. The tmpfile() will serve to store the page you are downloading.

//Gravar a página em um arquivo temporário
$handle = tmpfile();
fwrite($handle, $html);
fseek($handle, 0);

$html = NULL;

$initiate = false;
$inTitle = false;

//Função usada para remover elementos das linhas desnecessários
function removeTags($data) {
    $data = trim($data);
    $data = preg_replace('/[<][^>]+[>]|[<][^<>]+$|^[^<>]+[>]/', '', $data);
    return trim($data);
}

//No while iremos verificar linha a linha
while (false === feof($handle)) {
    $buffer = fgets($handle);//Lê a linha

    //Se a linha é vazio ignora e vai para a proxima linha
    if (trim($buffer) === '') {
        continue;
    }

    //Detecta aonde começa o artigo
    $findme = strpos($buffer, '>Art.') !== false;

    //Detecta um "possivel" termino do artigo ou titulo
    $endLine = stripos($buffer, '</p>') !== false;

    if ($findme) {

        //Se for já tiver ao menos um artigo adicionado ao corpo então isto detecta que terminou de listar os itens do artigo anterior
        if ($initiate) {
            echo '<hr>', PHP_EOL, PHP_EOL;
        }

        //Informa que encontrou ao menos um artigo
        $initiate = true;

        //Informa que estamos no titulo do artigo
        $inTitle = true;
        echo '<h1>', removeTags($buffer);
    } else if ($inTitle && $endLine) {
        //Se estiver no titulo e detectou um possivel fechamento do titulo
        $inTitle = false;
        echo removeTags($buffer), '</h1>', PHP_EOL;
    } else if ($initiate) {
        //Se não estiver dentro de um titulo ele imprime os dados
        $data = removeTags($buffer);

        //Se a linha for vazia então pula para a proxima linha
        if ($data === '') {
            continue;
        }

        echo $data, $inTitle ? '' : ('<br>' . PHP_EOL);
    }
}

//Fecha o arquivo temporario
fclose($temp);

Note that you can change the tmpfile for fopen and save formatted HTML so you don’t need to redo the search.

This code is just an example, so I did not make it do everything necessary, still it is necessary some more details, but the process is the same, just you work using the variables to detect where the article starts and ends for example.

  • Good afternoon William, I saw your answer, thank you very much, I am studying this code you sent, there is much to learn yet, kkkk. What do you think of this simplehtmldom - http://simplehtmldom.sourceforge.net/, do you think it would help? And if you can answer me another question, I would be very grateful. Why does the official constitution site have that html formatting? Thanks again, William

  • So @Alêmoraes I tried this code, you might even get it, but as I said, it’s quite laborious, just like Xpath. At least by the HTML you are using this can be laborious.

  • William, I’m trying to understand what you sent me, but it’s still a little advanced for me, but I’m studying their commands, thank you? I by sublime text, I was able to get a quick result, and I posted on this question, see what you found http://answall.com/questions/59374/attributr-o-id-de-acordo-com-o-conte%C3%Bado-do-texto

  • Good evening William Nascimento, see what you think of this solution I found. http://answall.com/questions/59374/attributr-o-id-de-acordo-o-conte%C3%Bado-do-texto

  • @Alêmoraes Sincerely friend, I find it very hard to assign Ids to something that can be simpler.

  • But, to have control for each article, it would not be with ids?

  • @But with this example who assigns the Ids is "you", ie your code will need to assign Ids and then capture them and yet you will have difficulty in capturing the sub-topics.

  • Guilherme, you are commenting on the old solution (which is on this page), or on the new solution which is on this page http://answall.com/questions/59516/id-de-acordo-com-o-conte%C3%Bado-da-string. I agree, that still the ID is only an increment, I am studying its code, to implement it, but in its code, only the head of the article is involved in a H1, and the rest of the article is not inside the article. ...

Show 3 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.