Simple html dom grab "text/javascript" link?

Asked

Viewed 305 times

-1

Like the url inside:

<script type="text/javascript">
    var src = "https:www.site.com";
</script>

I’ve tried to research but the examples I find I can’t change to what I need.

The code goes like this:

include('simple_html_dom.php');
$page = 'www.site.com';
$html = new simple_html_dom();
$html->load_file($page);

$links = array(); 
foreach($html->find(script) as $element) {
   $links[] = $element;
echo $element;
}

reset($links);

What I want is to get the link inside the

<script type="text/javascript">
  var src = "https:www.site.com";
</script>

Returning only this: https:www.site.com

  • Explain better what you’re trying to do.

  • Explain in more detail, so we understand your problem

2 answers

1

You can use the native PHP API called DOMDocument combined with curl or file_get_contents and then use preg_match, a simple example to understand:

<?php
$meuhtml = '
<script type="text/javascript">
    var src = "https:www.site.com";
</script>
<script type="text/javascript">
    var    src    = \'https:www.site2.com\';
</script>
';

$doc = new DOMDocument;
$doc->loadHTML($meuhtml);

$tags = $doc->getElementsByTagName('script');

$urls = array();

foreach ($tags as $tag) {
    if (preg_match('#var\s+src(\s+|)=(\s+|)(".*";|\'.*\';)#', $tag->nodeValue, $match)) {
        $result = preg_replace('#^["\']|["\'];$#', '', $match[3]);
        $urls[] = $result; //Adiciona ao array
    }
}

//Mostra todas urls
print_r($urls);

To used #var\s+src(\s+|)=(\s+|)(".*";|\'.*\';)# is who extracts the data returned by $tag->nodeValue. See working in https://repl.it/Hwt4 (click on the button Run when the page loads).

Of course this was an example to understand the code, to download the data from another site you can use the curl or whether in your php.ini the allow_url_fopen for on, example with Curl:

<?php
$url = 'http://site.com';

$ch = curl_init($url);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$data = curl_exec($ch);

if (!$data) {
     die('Erro');
}


$httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);

if ($httpcode !== 200) {
    die('Erro na requisição');
}

curl_close($ch);

$doc = new DOMDocument;
$doc->loadHTML($data);

$tags = $doc->getElementsByTagName('script');

$urls = array();

foreach ($tags as $tag) {
    if (preg_match('#var\s+src(\s+|)=(\s+|)(".*";|\'.*\';)#', $tag->nodeValue, $match)) {
        $result = preg_replace('#^["\']|["\'];$#', '', $match[3]);
        $urls[] = $result; //Adiciona ao array
    }
}

//Mostra todas urls
print_r($urls);

Or if you just want to get the first URL change to:

$url = '';

foreach ($tags as $tag) {
    if (preg_match('#var\s+src(\s+|)=(\s+|)(".*";|\'.*\';)#', $tag->nodeValue, $match)) {
        $result = preg_replace('#^["\']|["\'];$#', '', $match[3]);
        $url = $result;

        break;// Finaliza o foreach assim que encontrar a url
    }
}

echo $url;

0

Just use the Xpath of PHP, basically the following:

$html = "seu HTML obtido por file_get_content ou por cURL...";

$DOM = new DOMDocument;
$DOM->loadHTML($html);

$XPath = new DomXPath($DOM);

$TagScriptJavascript = $XPath->query('//script[@type="text/javascript"]');

foreach($TagScriptJavascript as $item){

    if(preg_match('/var src = "(.*)";/', $item->nodeValue, $url)){

        echo $url[1];

    }

}

Explanations:

  1. First start DOM with your HTML, obtained anyway.

  2. The $TagScriptJavascript return all elements that are script and who possess the attribute of type with the value of text/javascript, conforms to the query (//script[@type="text/javascript"]).

  3. The foreach will make the option 4 for each $TagScriptJavascript obtained.

  4. The preg_match will seek for var src="(qualquer coisa)";, if he finds it will show, due to the echo $url[1].


Test it out here.

  • type=text/javascript? There are people who use and there are people who do not :), personally I find the use of DomXPath in this specific case unnecessary, for other cases it is a great option and really facilitates life.

  • I used more for the title, pegar link “text/javascript”, I got this in my head, not that it’s really necessary. The biggest problem I think is the REGEX, which is simple, any more space it breaks, what your answer "predicts" this.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.