About Regular Expression in PHP - How to Get Part of a Text?

Question

About Regular Expression in PHP - How to Get Part of a Text?

Asked 8 years, 9 months ago

Viewed 902 times

0

I’m having trouble picking up part of a text from a page on Wikipedia. I can hold the title that way:

$content = 
     file_get_contenst("https://en.wikipedia.org/wiki/Nature_conservation");

preg_match("/< title>(.*?)<\/title>/",$content,$title);

What I’m not getting is to take the content that goes from <div id="content" class="mw-body" role="main"> until <span class="mw-headline" id="Ver_tamb.C3.A9m">Ver também</span>

I don’t understand why it doesn’t work.

do with explode da a explode in the first part of it you take the array[1] and after a explode to where you want q take and take the array 0

– Jasar Orion

2016/09/25 at 23:46
The problem is that this expression is not picking up any content. I’m starting to think that preg_match can’t get very large content, or so I don’t know what’s going on

– Aprendiz

2016/09/27 at 15:34
It remains to put in the question the excerpt of the source to be picked up, and explain the rules to pick up. As it is, can not answer safely.

– Bacco

2016/09/27 at 16:19
I don’t know the rules to catch. As I said the section I want to take goes from <div id="content" class="Mw-body" role="main"> until <span class="Mw-headline" id="Ver_tamb.C3.A9m">See also</span> the source is $content = file_get_contenst("https://pt.wikipedia.org/wiki/Conserva%C3%A7%C3%A3o_da_nature");

– Aprendiz

2016/09/28 at 00:38

1 answer

Browser other questions tagged php regex preg-match

You are not signed in. Login or sign up in order to post.

by Wallace Maxters • **102,340** points · Answer 1 · 2016-09-26T14:59:44+00:00

Wouldn’t it be better to use `DomDocument`?

In my humble opinion, any recourse that already exists to solve a problem, that should be chosen. I think using regular expressions for cases like yours is gonna take a lot of work.

So I recommend using DomDocument, which is intended to represent an HTML or XML entity.

See an example of how it could be done:

$content = 
 file_get_contents("https://en.wikipedia.org/wiki/Nature_conservation")

$doc = new DOMDocument();

@$doc->loadHTML($content);


$titleTag = $doc->getElementsByTagName('title')->item(0);

// Pega o título da página

$title = $titleTag ? $titleTag->nodeValue : null;

// Pega o valor da div#content, porém somente texto

$body = $doc->getElementById('content')->nodeValue;

Note that the method nodeValue return only the text, thus removing all tags present within #content.

If you need to get text with tags, use the method saveXml to solve the problem:

 $bodyWithTags = $doc->saveXml($doc->getElementById('content'));

Updating

If you want a reusable way to get only the page title, you can create a function:

/**
 * Obtém o título da tag <title> de uma url
 * 
 * @param string $url
 * @return string|null
 * */
function  url_get_title($url) {

    $content = file_get_contents($url);

    $doc = new DOMDocument();

    @$doc->loadHTML($content);

    $titleTag = $doc->getElementsByTagName('title')->item(0);

    if ($titleTag) {
        return $titleTag->nodeValue;
    }

    return null;
}

So, when I wanted to get the title of the page, I would just do so:

url_get_title('http://www.google.com'); // string (Google)

OBSERVING: Whenever you go to use file_get_contents To capture the content of a url, remember that you are always required to inform the url schema (http or https). If you don’t, PHP will try to open the path to a file. Even if a request is made for the domain itself it is necessary to include the schema.

About Regular Expression in PHP - How to Get Part of a Text?

1 answer

Wouldn’t it be better to use DomDocument?

Updating

Wouldn’t it be better to use `DomDocument`?