2
You can use a Curl for this and then use Domdocument (or REGEX) to get the page data.
Facebook uses the Open Graph Markup, since many websites support it you can also read such data.
I’m using as an example
http://g1.globo.com/rj/sul-do-rio-costa-verde/noticia/2017/01/acidente-com-teori-zavascki-aviao-comeca-ser-retirado-do-mar.html
, which is the latest news from Globo.com right now.
You can extract from this page the meta og:image
and the og:title
and also the og:description
. Moreover all the websites contain the meta
standards or is expected to have the description
and the title
.
For example, using as a basis an answer to another question:
// Obtem o HTML da página
$ch = curl_init('http://g1.globo.com/rj/sul-do-rio-costa-verde/noticia/2017/01/acidente-com-teori-zavascki-aviao-comeca-ser-retirado-do-mar.html');
curl_setopt_array($ch, [
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_SSL_VERIFYHOST => 2,
CURLOPT_SSL_VERIFYPEER => true,
CURLOPT_PROTOCOLS => CURLPROTO_HTTP | CURLPROTO_HTTPS,
CURLOPT_REDIR_PROTOCOLS => CURLPROTO_HTTP | CURLPROTO_HTTPS,
CURLOPT_TIMEOUT => 5,
CURLOPT_MAXREDIRS => 2
]);
$html = curl_exec($ch);
curl_close($ch);
// Inicia o DOM e XPath:
$DOM = new DOMDocument;
$DOM->loadHTML($html);
$XPath = new DomXPath($DOM);
// Propriedades buscadas
$propriedades = ['description', 'title', 'type', 'image'];
// Verifica cada item da Array:
foreach ($propriedades as $propriedade){
$Meta = $XPath->query('//head//meta[(@property="og:'.$propriedade.'") or (@name="'.$propriedade.'")] | //head//'.$propriedade);
// Se achar o elemento irá obter o resultado
if($Meta->length !== 0){
$conteudo[$propriedade] = $Meta->item(0)->getAttribute('content') !== '' ? $Meta->item(0)->getAttribute('content') : $Meta->item(0)->nodeValue;
}
}
Upshot:
array(4) {
["description"]=>
string(134) "Serviço de remoção aconteceu no início da noite deste domingo (22).
Retirada foi feita por empresa contratada pelo Grupo Emiliano."
["title"]=>
string(73) "Acidente com Teori Zavascki: Avião que caiu em Paraty é retirado do mar"
["type"]=>
string(7) "article"
["image"]=>
string(122) "http://s2.glbimg.com/IAaOKflQpOoOSoi7pGNjkmirtjI=/1200x630/filters:max_age(3600)/s02.video.glbimg.com/deo/vi/65/44/5594465"
}
With this information you can mount the HTML as you wish.
Explanations:
CURL:
The CURLOPT_FOLLOWLOCATION
is used to follow the location:
if this is informed by the server, CURLOPT_RETURNTRANSFER
is necessary to obtain the result, already the CURLOPT_SSL_VERIFYHOST
and CURLOPT_SSL_VERIFYPEER
have been turned off so you can get the information even on a server that has a certificate self-signed for example. You can also add timeout and a maximum redirect.
XPATH:
It is used to fetch the query information:
//head//meta[(@property="og:'.$propriedade.'") or (@name="'.$propriedade.'")] | //head//'.$propriedade
This will make all situations below valid:
<head>
<description>Valor</description>
<meta name="description" content="Valor" />
<meta property="og:description" content="Valor" />
</head>
To check if there has been any occurrence, if there really is any data, is used:
$Meta->length !== 0
How the content can be inside the content
(in the last two examples) or inside the tag itself (in the first example), was used:
$conteudo[$propriedade] = $Meta->item(0)->getAttribute('content') !== '' ? $Meta->item(0)->getAttribute('content') : $Meta->item(0)->nodeValue;
This will check if the attribute exists content
, will actually check if there is any data in it, if it will not get the value of the element.
Your question seems rather broad. But just to be sure, you want to make a system that "reads" the content of the other website or that your website is shown this way on when shared on Facebook?
– Inkeliz
@Inkeliz a system that reads the content of other websites !
– PululuK