Wouldn’t it be better to use DomDocument
?
In my humble opinion, any recourse that already exists to solve a problem, that should be chosen. I think using regular expressions for cases like yours is gonna take a lot of work.
So I recommend using DomDocument
, which is intended to represent an HTML or XML entity.
See an example of how it could be done:
$content =
file_get_contents("https://en.wikipedia.org/wiki/Nature_conservation")
$doc = new DOMDocument();
@$doc->loadHTML($content);
$titleTag = $doc->getElementsByTagName('title')->item(0);
// Pega o título da página
$title = $titleTag ? $titleTag->nodeValue : null;
// Pega o valor da div#content, porém somente texto
$body = $doc->getElementById('content')->nodeValue;
Note that the method nodeValue
return only the text, thus removing all tags present within #content
.
If you need to get text with tags, use the method saveXml
to solve the problem:
$bodyWithTags = $doc->saveXml($doc->getElementById('content'));
Updating
If you want a reusable way to get only the page title, you can create a function:
/**
* Obtém o título da tag <title> de uma url
*
* @param string $url
* @return string|null
* */
function url_get_title($url) {
$content = file_get_contents($url);
$doc = new DOMDocument();
@$doc->loadHTML($content);
$titleTag = $doc->getElementsByTagName('title')->item(0);
if ($titleTag) {
return $titleTag->nodeValue;
}
return null;
}
So, when I wanted to get the title of the page, I would just do so:
url_get_title('http://www.google.com'); // string (Google)
OBSERVING: Whenever you go to use file_get_contents
To capture the content of a url, remember that you are always required to inform the url schema (http or https). If you don’t, PHP will try to open the path to a file. Even if a request is made for the domain itself it is necessary to include the schema.
do with explode da a explode in the first part of it you take the array[1] and after a explode to where you want q take and take the array 0
– Jasar Orion
The problem is that this expression is not picking up any content. I’m starting to think that preg_match can’t get very large content, or so I don’t know what’s going on
– Aprendiz
It remains to put in the question the excerpt of the source to be picked up, and explain the rules to pick up. As it is, can not answer safely.
– Bacco
I don’t know the rules to catch. As I said the section I want to take goes from <div id="content" class="Mw-body" role="main"> until <span class="Mw-headline" id="Ver_tamb.C3.A9m">See also</span> the source is $content = file_get_contenst("https://pt.wikipedia.org/wiki/Conserva%C3%A7%C3%A3o_da_nature");
– Aprendiz