Take data from another site, through Classname

Asked

Viewed 1,708 times

0

How can I search for data present in another site’s HTML?

The html part of the other site is this:

<p class=MsoNormal align=center style='text-align:center'>
  <span style='font-size:10.0pt;font-family:"Arial","sans-serif";mso-fareast-font-family: "Times New Roman";color:black'>COTAÇÕES</span>
  <span style='mso-fareast-font-family: "Times New Roman"'><o:p></o:p></span>
</p>

In the code of my website I tried to do so:

 $html = new DOMDocument();
 $html->loadHTMLFile('http://www.agropan.coop.br/cotac.htm');

 echo $html->getElementByClassName('MsoNormal').getAttribute("p");

In fact, I would like to search only the content "QUOTATIONS" since that he be sought by ClassName, how should I do?

  • Oops, good night, can you please explain to me what you intend to do? You have a website that you need to get only what you have inside a div that has an "x" class that’s it?

  • exactly friend, I need to search the content of another site of a certain tag, through Classname, I put this code above to exemplify

  • Got it, I’ve done it, you’ll use Curl, I’m not in front of a computer now, later if I haven’t solved, I length in putting an example here!

  • I edited to avoid errors. see if it works that way.

  • The DOM is more complete and more complicated than, in your case, killing an ant with a cannonball. Experiment with the Simplexmlelement

2 answers

1


    <meta http-equiv="content-type" content="text/html; charset=utf-8" />
    <?php
    error_reporting(E_ALL & ~ E_NOTICE);
    $html = file_get_contents("http://www.agropan.coop.br/cotac.htm");

    $DOM = new DOMDocument();
    libxml_use_internal_errors(true);
    $DOM->loadHTML($html);
    libxml_clear_errors();
    $finder = new DomXPath($DOM);
    $classname = 'MsoNormal';
    $nodes = $finder->query("//*[contains(@class, '$classname')]");
    foreach ($nodes as $node) {
      $result=$result.$node->nodeValue."***";
    }

    $result = preg_replace(array("/\t/", "/\s{2,}/", "/\n/"), array("", " ", " "), $result);
    $partes = explode('***',$result);
    $cotacoes=$partes[0];
    $cotacoes = trim(preg_replace('/[\r\n]+/', '', $cotacoes));
    $cotacoes = str_replace("COTAÇÕES", "", $cotacoes);

    echo $cotacoes;     

    ?>

Other ways to avoid errors due to invalid entities "Tag o:p invalid in Entity".

1: Making replace of these invalid entities:

    <meta http-equiv="content-type" content="text/html; charset=utf-8" />
    <?php
    error_reporting(E_ALL & ~ E_NOTICE);
    $html = file_get_contents("http://www.agropan.coop.br/cotac.htm");

    $search = array("<o:p>", "</o:p>");
    $replace = array("", "","<div>");
    $html = str_replace($search, $replace, $html);

    $DOM = new DOMDocument();
    $DOM->loadHTML($html);
    $finder = new DomXPath($DOM);
    $classname = 'MsoNormal';
    $nodes = $finder->query("//*[contains(@class, '$classname')]");
    foreach ($nodes as $node) {
      $result=$result.$node->nodeValue."***";
    }

    $result = preg_replace(array("/\t/", "/\s{2,}/", "/\n/"), array("", " ", " "), $result);
    $partes = explode('***',$result);
    $cotacoes=$partes[0];
    $cotacoes = trim(preg_replace('/[\r\n]+/', '', $cotacoes));
    $cotacoes = str_replace("COTAÇÕES", "", $cotacoes);

    echo $cotacoes;     

    ?>

2: Using a @ in $DOM->loadHTML($html); `@$DOM->loadHTML($html);

    <meta http-equiv="content-type" content="text/html; charset=utf-8" />
    <?php
    error_reporting(E_ALL & ~ E_NOTICE);
    $html = file_get_contents("http://www.agropan.coop.br/cotac.htm");

    $DOM = new DOMDocument();
    @$DOM->loadHTML($html);
    $finder = new DomXPath($DOM);
    $classname = 'MsoNormal';
    $nodes = $finder->query("//*[contains(@class, '$classname')]");
    foreach ($nodes as $node) {
      $result=$result.$node->nodeValue."***";
    }

    $result = preg_replace(array("/\t/", "/\s{2,}/", "/\n/"), array("", " ", " "), $result);
    $partes = explode('***',$result);
    $cotacoes=$partes[0];
    $cotacoes = trim(preg_replace('/[\r\n]+/', '', $cotacoes));
    $cotacoes = str_replace("COTAÇÕES", "", $cotacoes);

    echo $cotacoes;     

    ?>

By id just replace

$classname = 'MsoNormal';
$nodes = $finder->query("//*[contains(@class, '$classname')]");

for

$id = 'MsoNormal';
$nodes = $finder->query("//*[contains(@id, '$id')]");
  • thank you for helping, I tried so, but here seemed several errors, such as "Notice: Domdocument::loadHTML(): Namespace prefix o is not defined in Entity, line: 462" and "Warning: Domdocument::loadHTML(): Tag o:p invalid in Entity" and "Warning: Domxpath:::query(): Invalid Expression"

  • If you look at the source code of the www.agropan.Coop.br/cotac.htm page you will understand these errors. It has tag p class=Msonormal whose content is <o:p></o:p>. So the error is that the Tag o: p is an invalid entity. To avoid this I added libxml_clear_errors();

  • In addition, only asking the page owner to avoid using invalid entities.

  • sensational friend, thank you so much for your help, now I will study this code and deepen, great hug

  • Leo, what would change if instead of Class be by ID?

  • @Wladermuriloalexandro $nodes = $finder->query("//*[contains(@id, '$classname')]"); There to look good you change $classname for $id

  • $id = 'Msonormal'; $nodes = $Finder->query("//*[contains(@id, '$id')]");

Show 2 more comments

0

In php you need to make some changes in php.ini to allow file_get_contents to read urls, file_get_contents has the purpose of reading files, and by chance it also accesses Urls, for this reason I believe it is better to use Curl since this has both purpose and also more support for! But feel free to use it.

function file_get_contents_curl($url) {
    $ch = curl_init();

    curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);       

    $data = curl_exec($ch);
    curl_close($ch);

    return $data;
}

Get the address code you want, and with regular expression, capture the class node you need, I believe is the best way to do this!

See below for an example expression for this...

^<[a-z]\s[a-z]+\=[MsoNormal]+\salign=center\sstyle=\'text-align\:center\'\>\s+(.*)\s

Maybe this will solve!

  • @Andersoncarloswoss well, in php you need to make some changes in php.ini to allow file_get_contents to read urls, file_get_contents has the purpose of reading files, and by chance it also access Urls, for this reason I believe it is best to use Curl once this.

  • @Andersoncarloswoss made!

  • Thanks Marcus, it worked out here buddy

Browser other questions tagged

You are not signed in. Login or sign up in order to post.