Simple_html_dom what is the difference between the two Urls?

Question

Simple_html_dom what is the difference between the two Urls?

Asked 6 years, 11 months ago

Viewed 88 times

-1

Url2 works and can extract the data, Url1 does not.

<?php 

include "simple_html_dom.php";
$CARDGALGO = file_get_html("$URLX");

echo $CARDGALGO;

?>

1 answer

Browser other questions tagged php html dom web-scraping

You are not signed in. Login or sign up in order to post.

by Guilherme Nascimento • **98,651** points · Answer 1 · 2018-09-07T17:29:47+00:00

I debugged the script and noticed that URL1 passes the limit of MAX_FILE_SIZE, which is currently 600000, see simple_html_dom.php line 66:

 define('MAX_FILE_SIZE', 600000);

Then you can increase this limit or you can stop using extra libs and use the native PHP API:

http://php.net/manual/en/domdocument.loadhtmlfile.php

Example:

<?php

$URL1 = "http://ladbrokes.365dm.com/greyhounds/profile/dog/oor-millie/3334094";

$doc = new DOMDocument;
$doc->loadHTMLFile($URL1);

To catch a specific element you can use:

Grab by ID http://php.net/manual/en/domdocument.getelementbyid.php
Take all elements of a type http://php.net/manual/en/domdocument.getelementsbytagname.php

Grabbing the text of a specific element by ID:

<?php

$URL1 = "http://ladbrokes.365dm.com/greyhounds/profile/dog/oor-millie/3334094";

$doc = new DOMDocument;
$doc->loadHTMLFile($URL1);

echo 'Texto:', $doc->getElementById('logo')->textContent, '<br>';

This example takes this part of the current page:

<header id="header" role="banner">
    <div class="hix">
        <a href="greyhounds" id="logo">Ladbrokes</a>
                <div id="nav-mobile-open"></div>
            </div>            
</header>

To take all elements of a type, like all links, would be something like:

<?php

$URL1 = "http://ladbrokes.365dm.com/greyhounds/profile/dog/oor-millie/3334094";

$doc = new DOMDocument;
$doc->loadHTMLFile($URL1);

foreach ($doc->getElementsByTagName('a') as $node) {
    echo 'Texto:', $node->textContent, '<br>';
}

Using Domxpath

But surely the most practical way to catch specific elementros is to use Xpath, as on this page the column "4" of each row in the table represents the name of the coach so the Xpath to be used would be something like:

//tr/td[4]

Example:

<?php

$URL1 = "http://ladbrokes.365dm.com/greyhounds/profile/dog/oor-millie/3334094";

$doc = new DOMDocument;
$doc->loadHTMLFile($URL1);

$xpath = new DOMXpath($doc);

$colunas = $xpath->query("//tr/td[4]");

echo 'Treinadores:<br>';

foreach ($colunas as $node) {
    $nome = trim($node->textContent);
    echo ' - ', $nome, '<br>';
}

Avoiding warnings/warnings because of HTML errors on a page

These links you have added have many HTML errors, which can emit many warnings, so to avoid this being displayed you can simply turn on and delisgar the internal errors of the API, thus:

<?php

$URL1 = "http://ladbrokes.365dm.com/greyhounds/profile/dog/oor-millie/3334094";

$doc = new DOMDocument;

$estadoOriginal = libxml_use_internal_errors(true);

$doc->loadHTMLFile($URL1);

libxml_clear_errors();

libxml_use_internal_errors($estadoOriginal);