How to make a "link format" (a system that reads the content of other Webs sites)?

Question

How to make a "link format" (a system that reads the content of other Webs sites)?

Asked 8 years, 11 months ago

Viewed 92 times

2

I would like to integrate a system similar to Facebook for reading external links in my project.

Like when posting a "www.un-site-any.com" link on my website would like to get a result like the figure below !

Your question seems rather broad. But just to be sure, you want to make a system that "reads" the content of the other website or that your website is shown this way on when shared on Facebook?

– Inkeliz

2017/01/23 at 08:05
@Inkeliz a system that reads the content of other websites !

– PululuK

2017/01/23 at 08:15

1 answer

Browser other questions tagged php javascript html5 facebook

You are not signed in. Login or sign up in order to post.

by Inkeliz • **20,671** points · Answer 1 · 2017-01-23T08:48:13+00:00

You can use a Curl for this and then use Domdocument (or REGEX) to get the page data.

Facebook uses the Open Graph Markup, since many websites support it you can also read such data.

I’m using as an example http://g1.globo.com/rj/sul-do-rio-costa-verde/noticia/2017/01/acidente-com-teori-zavascki-aviao-comeca-ser-retirado-do-mar.html, which is the latest news from Globo.com right now.

You can extract from this page the meta og:image and the og:title and also the og:description. Moreover all the websites contain the meta standards or is expected to have the description and the title.

For example, using as a basis an answer to another question:

// Obtem o HTML da página
$ch = curl_init('http://g1.globo.com/rj/sul-do-rio-costa-verde/noticia/2017/01/acidente-com-teori-zavascki-aviao-comeca-ser-retirado-do-mar.html');
curl_setopt_array($ch, [    
    CURLOPT_FOLLOWLOCATION => true,
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_SSL_VERIFYHOST => 2,
    CURLOPT_SSL_VERIFYPEER => true,
    CURLOPT_PROTOCOLS => CURLPROTO_HTTP | CURLPROTO_HTTPS,  
    CURLOPT_REDIR_PROTOCOLS => CURLPROTO_HTTP | CURLPROTO_HTTPS,
    CURLOPT_TIMEOUT => 5,
    CURLOPT_MAXREDIRS => 2
]);
$html = curl_exec($ch);
curl_close($ch);

// Inicia o DOM e XPath:
$DOM = new DOMDocument;
$DOM->loadHTML($html);
$XPath = new DomXPath($DOM);

// Propriedades buscadas
$propriedades = ['description', 'title', 'type', 'image'];

// Verifica cada item da Array:
foreach ($propriedades as $propriedade){

    $Meta = $XPath->query('//head//meta[(@property="og:'.$propriedade.'") or (@name="'.$propriedade.'")] | //head//'.$propriedade);

    // Se achar o elemento irá obter o resultado
    if($Meta->length !== 0){
        $conteudo[$propriedade] = $Meta->item(0)->getAttribute('content') !== '' ? $Meta->item(0)->getAttribute('content') : $Meta->item(0)->nodeValue;
    }


}

Upshot:

array(4) {
  ["description"]=>
  string(134) "Serviço de remoção aconteceu no início da noite deste domingo (22).
Retirada foi feita por empresa contratada pelo Grupo Emiliano."
  ["title"]=>
  string(73) "Acidente com Teori Zavascki: Avião que caiu em Paraty é retirado do mar"
  ["type"]=>
  string(7) "article"
  ["image"]=>
  string(122) "http://s2.glbimg.com/IAaOKflQpOoOSoi7pGNjkmirtjI=/1200x630/filters:max_age(3600)/s02.video.glbimg.com/deo/vi/65/44/5594465"
}

With this information you can mount the HTML as you wish.

Explanations:

CURL:

The CURLOPT_FOLLOWLOCATION is used to follow the location: if this is informed by the server, CURLOPT_RETURNTRANSFER is necessary to obtain the result, already the CURLOPT_SSL_VERIFYHOST and CURLOPT_SSL_VERIFYPEER have been turned off so you can get the information even on a server that has a certificate self-signed for example. You can also add timeout and a maximum redirect.

XPATH:

It is used to fetch the query information:

//head//meta[(@property="og:'.$propriedade.'") or (@name="'.$propriedade.'")] | //head//'.$propriedade

This will make all situations below valid:

<head>
<description>Valor</description>
<meta name="description" content="Valor" />
<meta property="og:description" content="Valor" />
</head>

To check if there has been any occurrence, if there really is any data, is used:

$Meta->length !== 0

How the content can be inside the content (in the last two examples) or inside the tag itself (in the first example), was used:

$conteudo[$propriedade] = $Meta->item(0)->getAttribute('content') !== '' ? $Meta->item(0)->getAttribute('content') : $Meta->item(0)->nodeValue;

This will check if the attribute exists content, will actually check if there is any data in it, if it will not get the value of the element.