capture information from websites

Question

capture information from websites

Asked 10 years, 3 months ago

Viewed 5,382 times

4

How do Buscapé and other sites manage to get the information from the sites? is through the Curl or an xml that stores websites make available?

4

Although the answers are good and give an idea of how and done such a process, there is no way to really know how the process is done, only the developers of these companies could inform. I believe it is by "feeds" in some cases, but there is no way to be sure.

– Guilherme Nascimento

2015/05/11 at 19:58

3 answers

6

There are several forms and techniques that can get information from other sites, the name given to this technique is 'parse', many programmers here speak erroneously 'website parsing', if the websites offer the XML to Buscapé, soon the work of the site engineers will fall quite by XML already contain the tags formatted correctly, making work faster for the PHP, for the simplexml_load_file is very fast and easy to use.

But if the site does not offer such a file, the solution may be Crawling to obtain the links, or cURL same, the cURL will only serve to pass and get the HTML data from the remote server, using something like POST or GET, then to "take" this data, you can use the DOMDocument, which is what I most use in conjunction with the DomXpath which is a subfunction of the DOM which serves to analyze the HTML, but also has the Simple HTML DOM Parser.

Here’s an example I just did to show you, capturing G1 data:

    libxml_use_internal_errors(true) and libxml_clear_errors();
    $header = "X-Forwarded-For: {$_SERVER['REMOTE_ADDR']}";
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL,   "http://g1.globo.com/bemestar/noticia/2011/03/medica-orienta-sobre-o-que-fazer-em-caso-de-dor-de-ouvido-e-como-evita-la.html");
    curl_setopt($ch, CURLOPT_REFERER, "http://g1.globo.com");
    curl_setopt($ch, CURLOPT_HTTPHEADER, array($header));
    curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    $html = curl_exec($ch);
    $DOM = new DOMDocument();
    $DOM->loadHTML($html);
    $xpath = new DomXpath($DOM);
    $titulo = $xpath->query('//input[@name="materia_titulo"]/@value')->item(0);
    $letra = $xpath->query('//div[@id="materia-letra"]')->item(0);
    echo "Titulo da matéria: ". $titulo->nodeValue . "<p>" . "Conteúdo da matéria: "   .$letra->nodeValue;

1

I found it very good, but a certain job take the value of the html tags and save in a database with various tags of various sites.

– Xiro Nakamura

2015/05/14 at 13:24
1

Yes it is a certain work! but doing a well-designed architecture, the service is excellent!

– Cassiano José

2015/05/14 at 16:57
cool this stop. Works with any website if use Curl Cassiano José?

– DiChrist

2016/07/12 at 14:43

Browser other questions tagged php xml curl

You are not signed in. Login or sign up in order to post.

by Miguel Mesquita Alfaiate • **1,048** points · Answer 1 · 2015-05-04T14:14:51+00:00

It depends on the sites, it’s not something generic.

You may seek information from:

Sitemaps
information feeds (JSON for example)
Apis
crawling by the pages and links of the websites
other mechanisms...

by Ricardo • **14,521** points · Answer 2 · 2015-05-04T14:30:23+00:00

There are several alternatives to search for content from a site:

Parsing the site: It will literally download the HTML and you can check the DOM elements of the page. PHP library for this purpose: Simple HTML DOM Parser

Parsing the XML provided by the site: You can use PHP native functions for this, see example below:

$feed = simplexml_load_file($feedLink, 'SimpleXMLElement', LIBXML_NOCDATA);

foreach($feed->channel->item AS $item){
    if($count == $limit){
        break;
    }    
    echo $item->link . '<br />';
    echo $item->title . '<br />';
    echo $item->description . '<br />';
    echo $item->pubDate . '<br />';
    echo '<br />------------------<br /><br />';
    $count++;
}

Crawling: Will follow the links and is used in conjunction with a parser (which will extract the information from the pages). PHP library for this purpose: Phpcrawl