Scan source code and find wikipedia url

Asked

Viewed 204 times

3

I’m having trouble at the regex :/

I was using this Pattern to get wikipedia urls from the source code of google searches wikipedia\.org[^\" ]+

But the urls are jammed that way: <a href="/url?q=http://en.wikipedia.org/wiki/World_Surfing_Games&amp;sa=U&amp;ei=yS6WVOvAA9HLsAShkYGoCw&amp;ved=0CBQQFjAB&amp;usg=AFQjCNFbV5WzVcG-aJbrvGdhbxz3wnPUKg" s he ends up hitting me: http://pt.wikipedia.org/wikiASP_World_Tour&sa=U&ei=yS6WVOvAA9HLsAShkYGoCw&ved=0CBQQFjAB&usg=Afqjcnfbv5wzvcg-aJbrvGdhbxz3wnPUKg

However this is not valid wikipedia url the correct would be just http://en.wikipedia.org/wiki/World_Surfing_Games

2 answers

5

Given the URL does not contain the character ? which is the denotation of the beginning of query string, a simple way is by making use of a regular expression that will remove everything after the first &:

$url = 'http://pt.wikipedia.org/wiki/ASP_World_Tour&sa=U&ei=yS6WVOvAA9HLsAShkYGoCw&ved=0CBQQFjAB&usg=AFQjCNFbV5WzVcG-aJbrvGdhbxz3wnPUKg';

$url = preg_replace('/\&.*/', '', $url);

See example in Ideone:

echo $url; // Saída: http://pt.wikipedia.org/wiki/ASP_World_Tour
  • Hey, which regex could I use to find a wikipedia url? I was trying with one of these (.*).wikipedia.org\/wiki\/(.*)[ ]+

  • @user3163662 This will depend on what you are doing, a Regex may not even be the ideal shape. Open a question with an example of the code and it will be easy to help.

2

Apparently you want to get the Wikipedia link through a correct Google search?

Well, I just created and tested a solution for you, maybe not one of the best, however it works! :D

<?

    $TermoDeBusca = urlencode('ASP World Tour'); // Termo de Busca

    // Curl! 
    $ch = curl_init ("");
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
    curl_setopt($ch, CURLOPT_URL, 'https://www.google.com/search?q='.$TermoDeBusca);
    curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'); 
    $html = curl_exec($ch);

    // DOM!
    $dom = new DOMDocument;
    $dom->loadHTML($html);

    $xpath = new DOMXPath($dom); 
    $items = $xpath->query("//h3[contains(@class, 'r')]//a"); //Pega dentro do <H3> (de classe 'r') o valor do <a>



        foreach ($items as $pega){ // Loop, para cada link

            $link = $pega->getAttribute('href'); // Será: http://pt.wikipedia.org/wiki/ASP_World_Tour

                if (strpos($link,'wikipedia.org') == true) { // Verifica se o $link contem o 'wikipedia.org', ou seja, se é do wikipedia ~~ gambiarra
                echo $link.'<br>'; // se for, ele mostra o link
                } // fimse

        } //fim do foreach
?>

I tried to comment as much as I could, unfortunately I don’t have time for that. I did what I could! D

Browser other questions tagged

You are not signed in. Login or sign up in order to post.