4
I’m doing a parser with simple_html_dom.php
where I pull all the links from a given page. I can pull the links and assign to an array, here comes the problem:
- this page has a maximum display limit of 36 items per page.
- items increase and decrease sporadically...
Example of the situation:
If the font I am pulling the parser has 133 items, due to the limitation of 36 items per page, I will have to parse 4x by changing the page number in the URL so that the check is done until pulling the 133 total items.
What I need:
Take the 133 items without having to specify a static limit for the counter, as the items increase and decrease, this limit has to be dynamic and automatic.
What I’ve already done:
require ("simple_html_dom.php");
//define o limite de tempo do script como 0
set_time_limit(0);
//variavel que conta o total de links encontrados
$nlinks = 0;
//string que pega o valor atraves do parser
$string = '';
//array que pega o valor do parser
$toyota =array();
//contadores
$cont = 0;
$x=1;
/****************
PRECISO QUE O A REPETIÇÃO ABAIXO (WHILE) SEJA REALIZADA ATÉ QUE
O ARRAY (TOYOTA) SEJA PREENCHIDO COM O TOTAL DE LINKS ENCONTRADOS
SEM EU TER QUE ESPECIFICAR UM LIMITE ESTÁTICO PARA O CONTADOR...
ISSO PRECISA SER DINÂMICO E AUTOMÁTICO, NO CASO ABAIXO COLOQUEI 4 ESTÁTICO
*****************/
//enquanto o contador for menor que 4 entra no laço
while($cont < 4){
// get DOM from URL or file
$html = file_get_html('http://www.webmotors.com.br/comprar/carros/novos-usados/'
.'sp-sao-paulo/toyota/?tipoveiculo=carros&tipoanuncio=novos-usados&anunciante=pessoa'
.'%20f%C3%ADsica&marca=toyota&vehicle1=%7B%22marca%22:%22toyota%22%7D&location=%5B%7B'
.'%22state%22:%22s%C3%A3o%20paulo%22,%22abbr%22:%22sp%22%7D%5D&precoate=170000&anode'
.'=2012&kmate=30000&atributos=%C3%9Anico%20dono&p='.$cont."&o=3&qt=36");
//para cada link encontrado...
foreach($html->find('a') as $e){
$string = (string) $e->href;
//apenas verifica se o link nao tem a string "comprar/toyota"
if(strpos($string, 'comprar/toyota') != 1){
unset($html);
}else{
//verifica se o link tem a string "comprar/toyota"
if(strpos($string, 'comprar/toyota') == 1){
//transforma a string encontrada em um link ativo
$link = "<a href='http://www.webmotors.com.br/".$string. "'>".$string. "</a>";
//echo $link."<br>";
unset($html);
$nlinks++;
//insere o link no array
$toyota[$nlinks] = $link;
}
}
}
$cont++;
}
//pega o tamanho do array
$tam = sizeof($toyota);
//enquanto o contador for menor que o tamanho do array
while($x <= $tam){
//imprime o array na posição x
echo $toyota[$x]."<br>";
$x++;
}
echo "<br> ".$nlinks." carros da TOYOTA foram encontrados!<br>";
The way out:
Anyone who can give a hand.....
has only one detail...in the value of "file_get_html", at the end of the URL has a variable "$cont" which is the x of the question...because it needs to be dynamic, where when it changes the page I am checking on.
– Charles Fay
you can increase it
– Ricardo
So, as you can see in my code at the end of "foreach" it increases "$cont++" but it is not happening...because my verification now only pulls the first 36 results
– Charles Fay
it’s because you’re doing
unset($html);
do this only after the process has been completed.– Ricardo
Do you know if you have any way to speed up the parse with simple_html_dom? because in the checks I’m running takes on average...55 seconds for me to get results, and sometimes still gives the fatal error, I have to see why...I hosted on a server to go testing, check there if you can: [link]http://evolveti.com/paser/
– Charles Fay
improve the performance I do not know but usually the
fatal_error
is because thefile_get_html
cannot the page you want is returned afalse
but as the question was resolved?– Ricardo
In a way yes...I’m using your if($html) tip to try to fix the fatal error problem, but in relation to while($html) I couldn’t make it work this way...so I left while($cont < 40) because even with a high value, the host php server responds in relatively acceptable time...55 seconds media...there is working just need to adjust now the performance and handle the errors. Thank you!
– Charles Fay
because it didn’t work
while($html)
? validate the answer if it was useful otherwise give me more details =D– Ricardo
my counter is not incremented inside while($html), then the page only takes the 36 items of the first page without going through the others, I took the unset($html) and still did not roll...
– Charles Fay
It has how to increase it inside the while not?
– Ricardo
yes, but it turns out it only increments 1x...goes from 0 to 1 and exits the loop
– Charles Fay
out of the loop because you’re doing
unset($html);
do this only after the process has been completed– Ricardo
i put inside the while($html) and gave //unset($html) look what happens: http://evolveti.com/paser/toyota.php
– Charles Fay
Loot, uncheck unset, put the below of cont++ a control statement where you check if there are more pages to be parsed if yes you put a $html = true;
– Ricardo