Help with PHP repeat structure - PARSER - simple_html_dom.php

Asked

Viewed 299 times

4

I’m doing a parser with simple_html_dom.php where I pull all the links from a given page. I can pull the links and assign to an array, here comes the problem:

  • this page has a maximum display limit of 36 items per page.
  • items increase and decrease sporadically...

Example of the situation:

If the font I am pulling the parser has 133 items, due to the limitation of 36 items per page, I will have to parse 4x by changing the page number in the URL so that the check is done until pulling the 133 total items.

What I need:

Take the 133 items without having to specify a static limit for the counter, as the items increase and decrease, this limit has to be dynamic and automatic.

What I’ve already done:

require ("simple_html_dom.php");

//define o limite de tempo do script como 0
set_time_limit(0);

//variavel que conta o total de links encontrados
$nlinks = 0;

//string que pega o valor atraves do parser
$string = '';

//array que pega o valor do parser
$toyota =array();

//contadores
$cont = 0;
$x=1;

/****************
    PRECISO QUE O A REPETIÇÃO ABAIXO (WHILE) SEJA REALIZADA ATÉ QUE 
    O ARRAY (TOYOTA) SEJA PREENCHIDO COM O TOTAL DE LINKS ENCONTRADOS
    SEM EU TER QUE ESPECIFICAR UM LIMITE ESTÁTICO PARA O CONTADOR...
    ISSO PRECISA SER DINÂMICO E AUTOMÁTICO, NO CASO ABAIXO COLOQUEI 4 ESTÁTICO
*****************/

//enquanto o contador for menor que 4 entra no laço
while($cont < 4){   

// get DOM from URL or file
$html = file_get_html('http://www.webmotors.com.br/comprar/carros/novos-usados/'
.'sp-sao-paulo/toyota/?tipoveiculo=carros&tipoanuncio=novos-usados&anunciante=pessoa'
.'%20f%C3%ADsica&marca=toyota&vehicle1=%7B%22marca%22:%22toyota%22%7D&location=%5B%7B'
.'%22state%22:%22s%C3%A3o%20paulo%22,%22abbr%22:%22sp%22%7D%5D&precoate=170000&anode'
.'=2012&kmate=30000&atributos=%C3%9Anico%20dono&p='.$cont."&o=3&qt=36");

        //para cada link encontrado...
        foreach($html->find('a') as $e){
        $string = (string) $e->href;    

            //apenas verifica se o link nao tem a string "comprar/toyota"
            if(strpos($string, 'comprar/toyota') != 1){
                unset($html);
            }else{
                //verifica se o link tem a string "comprar/toyota"
                if(strpos($string, 'comprar/toyota') == 1){ 
                    //transforma a string encontrada em um link ativo
                    $link = "<a href='http://www.webmotors.com.br/".$string. "'>".$string. "</a>";

                    //echo $link."<br>";

                    unset($html);
                    $nlinks++;

                    //insere o link no array
                    $toyota[$nlinks] = $link;
                }                           
            }                       
        }
    $cont++;
    }

    //pega o tamanho do array
    $tam = sizeof($toyota);

    //enquanto o contador for menor que o tamanho do array
    while($x <= $tam){  
        //imprime o array na posição x
        echo $toyota[$x]."<br>";
        $x++;
    }

    echo "<br> ".$nlinks." carros da TOYOTA foram encontrados!<br>";

The way out:
inserir a descrição da imagem aqui Anyone who can give a hand.....

1 answer

3


Do so:

$html = true;

while($html){   

// get DOM from URL or file
$html = file_get_html('http://www.webmotors.com.br/comprar/carros/novos-usados/'
.'sp-sao-paulo/toyota/?tipoveiculo=carros&tipoanuncio=novos-usados&anunciante=pessoa'
.'%20f%C3%ADsica&marca=toyota&vehicle1=%7B%22marca%22:%22toyota%22%7D&location=%5B%7B'
.'%22state%22:%22s%C3%A3o%20paulo%22,%22abbr%22:%22sp%22%7D%5D&precoate=170000&anode'
.'=2012&kmate=30000&atributos=%C3%9Anico%20dono&p='.$cont."&o=3&qt=36");

Why?: because when the file_get_html cannot the page you want is returned a false.

Obs: that false should be addressed by the application as a general fatal_error example of treatment:

if($html){
//para cada link encontrado...
        foreach($html->find('a') as $e){
        } 
}
  • has only one detail...in the value of "file_get_html", at the end of the URL has a variable "$cont" which is the x of the question...because it needs to be dynamic, where when it changes the page I am checking on.

  • you can increase it

  • So, as you can see in my code at the end of "foreach" it increases "$cont++" but it is not happening...because my verification now only pulls the first 36 results

  • it’s because you’re doing unset($html); do this only after the process has been completed.

  • Do you know if you have any way to speed up the parse with simple_html_dom? because in the checks I’m running takes on average...55 seconds for me to get results, and sometimes still gives the fatal error, I have to see why...I hosted on a server to go testing, check there if you can: [link]http://evolveti.com/paser/

  • improve the performance I do not know but usually the fatal_error is because the file_get_html cannot the page you want is returned a false but as the question was resolved?

  • In a way yes...I’m using your if($html) tip to try to fix the fatal error problem, but in relation to while($html) I couldn’t make it work this way...so I left while($cont < 40) because even with a high value, the host php server responds in relatively acceptable time...55 seconds media...there is working just need to adjust now the performance and handle the errors. Thank you!

  • because it didn’t work while($html) ? validate the answer if it was useful otherwise give me more details =D

  • my counter is not incremented inside while($html), then the page only takes the 36 items of the first page without going through the others, I took the unset($html) and still did not roll...

  • It has how to increase it inside the while not?

  • yes, but it turns out it only increments 1x...goes from 0 to 1 and exits the loop

  • out of the loop because you’re doing unset($html); do this only after the process has been completed

  • i put inside the while($html) and gave //unset($html) look what happens: http://evolveti.com/paser/toyota.php

  • Loot, uncheck unset, put the below of cont++ a control statement where you check if there are more pages to be parsed if yes you put a $html = true;

Show 9 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.