4
How do Buscapé and other sites manage to get the information from the sites? is through the Curl or an xml that stores websites make available?
4
How do Buscapé and other sites manage to get the information from the sites? is through the Curl or an xml that stores websites make available?
6
There are several forms and techniques that can get information from other sites, the name given to this technique is 'parse'
, many programmers here speak erroneously 'website parsing', if the websites offer the XML
to Buscapé, soon the work of the site engineers will fall quite by XML
already contain the tags
formatted correctly, making work faster for the PHP
, for the simplexml_load_file
is very fast and easy to use.
But if the site does not offer such a file, the solution may be Crawling
to obtain the links, or cURL
same, the cURL
will only serve to pass and get the HTML data from the remote server, using something like POST
or GET
, then to "take" this data, you can use the DOMDocument
, which is what I most use in conjunction with the DomXpath
which is a subfunction of the DOM
which serves to analyze the HTML
, but also has the Simple HTML DOM Parser
.
Here’s an example I just did to show you, capturing G1 data:
libxml_use_internal_errors(true) and libxml_clear_errors();
$header = "X-Forwarded-For: {$_SERVER['REMOTE_ADDR']}";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://g1.globo.com/bemestar/noticia/2011/03/medica-orienta-sobre-o-que-fazer-em-caso-de-dor-de-ouvido-e-como-evita-la.html");
curl_setopt($ch, CURLOPT_REFERER, "http://g1.globo.com");
curl_setopt($ch, CURLOPT_HTTPHEADER, array($header));
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($ch);
$DOM = new DOMDocument();
$DOM->loadHTML($html);
$xpath = new DomXpath($DOM);
$titulo = $xpath->query('//input[@name="materia_titulo"]/@value')->item(0);
$letra = $xpath->query('//div[@id="materia-letra"]')->item(0);
echo "Titulo da matéria: ". $titulo->nodeValue . "<p>" . "Conteúdo da matéria: " .$letra->nodeValue;
I found it very good, but a certain job take the value of the html tags and save in a database with various tags of various sites.
Yes it is a certain work! but doing a well-designed architecture, the service is excellent!
cool this stop. Works with any website if use Curl Cassiano José?
5
It depends on the sites, it’s not something generic.
You may seek information from:
3
There are several alternatives to search for content from a site:
Parsing the XML provided by the site: You can use PHP native functions for this, see example below:
$feed = simplexml_load_file($feedLink, 'SimpleXMLElement', LIBXML_NOCDATA);
foreach($feed->channel->item AS $item){
if($count == $limit){
break;
}
echo $item->link . '<br />';
echo $item->title . '<br />';
echo $item->description . '<br />';
echo $item->pubDate . '<br />';
echo '<br />------------------<br /><br />';
$count++;
}
Crawling: Will follow the links and is used in conjunction with a parser (which will extract the information from the pages). PHP library for this purpose: Phpcrawl
Browser other questions tagged php xml curl
You are not signed in. Login or sign up in order to post.
Although the answers are good and give an idea of how and done such a process, there is no way to really know how the process is done, only the developers of these companies could inform. I believe it is by "feeds" in some cases, but there is no way to be sure.
– Guilherme Nascimento