About Regular Expression in PHP - How to Get Part of a Text?

Asked

Viewed 902 times

0

I’m having trouble picking up part of a text from a page on Wikipedia. I can hold the title that way:

$content = 
     file_get_contenst("https://en.wikipedia.org/wiki/Nature_conservation");

preg_match("/< title>(.*?)<\/title>/",$content,$title);

What I’m not getting is to take the content that goes from <div id="content" class="mw-body" role="main"> until <span class="mw-headline" id="Ver_tamb.C3.A9m">Ver também</span>

I don’t understand why it doesn’t work.

  • do with explode da a explode in the first part of it you take the array[1] and after a explode to where you want q take and take the array 0

  • The problem is that this expression is not picking up any content. I’m starting to think that preg_match can’t get very large content, or so I don’t know what’s going on

  • It remains to put in the question the excerpt of the source to be picked up, and explain the rules to pick up. As it is, can not answer safely.

  • I don’t know the rules to catch. As I said the section I want to take goes from <div id="content" class="Mw-body" role="main"> until <span class="Mw-headline" id="Ver_tamb.C3.A9m">See also</span> the source is $content = file_get_contenst("https://pt.wikipedia.org/wiki/Conserva%C3%A7%C3%A3o_da_nature");

1 answer

2

Wouldn’t it be better to use DomDocument?

In my humble opinion, any recourse that already exists to solve a problem, that should be chosen. I think using regular expressions for cases like yours is gonna take a lot of work.

So I recommend using DomDocument, which is intended to represent an HTML or XML entity.

See an example of how it could be done:

$content = 
 file_get_contents("https://en.wikipedia.org/wiki/Nature_conservation")

$doc = new DOMDocument();

@$doc->loadHTML($content);


$titleTag = $doc->getElementsByTagName('title')->item(0);

// Pega o título da página

$title = $titleTag ? $titleTag->nodeValue : null;

// Pega o valor da div#content, porém somente texto

$body = $doc->getElementById('content')->nodeValue;

Note that the method nodeValue return only the text, thus removing all tags present within #content.

If you need to get text with tags, use the method saveXml to solve the problem:

 $bodyWithTags = $doc->saveXml($doc->getElementById('content'));

Updating

If you want a reusable way to get only the page title, you can create a function:

/**
 * Obtém o título da tag <title> de uma url
 * 
 * @param string $url
 * @return string|null
 * */
function  url_get_title($url) {

    $content = file_get_contents($url);

    $doc = new DOMDocument();

    @$doc->loadHTML($content);

    $titleTag = $doc->getElementsByTagName('title')->item(0);

    if ($titleTag) {
        return $titleTag->nodeValue;
    }

    return null;
}

So, when I wanted to get the title of the page, I would just do so:

url_get_title('http://www.google.com'); // string (Google)

OBSERVING: Whenever you go to use file_get_contents To capture the content of a url, remember that you are always required to inform the url schema (http or https). If you don’t, PHP will try to open the path to a file. Even if a request is made for the domain itself it is necessary to include the schema.

  • Amigo @Wallace-maxters if you could help with this code, it would be of great value, I am the time searching how to read the title of a site to save in the database, I just need to know how to take the title and put it in a variable to save it, this code of yours is very linpo and beautiful but I could not make it work, could you simplify and put an example when you have time? very grateful.

  • the partial code I have is this | $domdocument = new Domdocument(); @$domdocument -> loadHTMLFile(urldecode($strri_array)); $domxpath = new Domxpath($domdocument); $title_path = $domxpath -> query("//title") -> item(0) -> nodeValue;

  • What’s the question, young man? Want to use take the title based on this code you’re giving me?

  • When I put the comment you updated your answer, is what is in your answer even if I wish, it worked very well. very grateful, abusing a little, if you know how to pick up toll information with the google maps api rsrs Matrix

  • Apparently it doesn’t work if I try to take from a domain inside the same server as the domain I’m putting the code right? I have two domains directed within the same host and when I try to get the title of any of them, I get error Warning: file_get_contents(http://itpin.tk/): failed to open stream: Redirection limit reached, aborting in /home/u915144746/public_html/Plugins/entry.php on line 9, in the case, as it would be an if(title) { ok } Else { no title } to put in place?

  • @flourigh php treats urls differently from files when you use file_get_contents. It’s not that it doesn’t work on the same server, it’s that you didn’t specify the http before.

  • actually, is with http yes, Warning: file_get_contents(http://itpin.tk/): failed to open stream: Redirection limit reached, aborting in /home/u915144746/public_html/Plugins/entry.php on line 9, stackoverflow is removing

  • 1

    It is redirecting. Read carefully the error that is being displayed and try to resolve it. If you don’t understand about the problem, search the internet or even ask a question on the website. There are no limits to asking questions here, as long as they meet the rules of the site.

  • got it, so it might be because I have a code that checks which url the index is being accessed from to open the correct program, 1 index with many urls. that must be it then

  • I really like your answer, it will help me in the future. But I still can’t get just the part I want, because I want to get only the content that goes from <div id="content" class="Mw-body" role="main"> until <span class="Mw-headline" id="Ver_tamb.C3.A9m">See also</span span>

  • i was trying this way preg_match_all("/<div id="content" class="Mw-body" role="main">(. *? )<span class="Mw-headline" id="Ver_tamb.C3.A9m">See also</span>/",$content,$Matches)

Show 6 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.