How to pick up parts of a text in php

Asked

Viewed 200 times

1

How to pick the text apart

"name":"ASUS p/ Intel LGA 1151 ATX ROG STRIX Z270E GAMING,DDR4,Aura Sync, Audio Gamer, Intel Network, SLI/CFX, Wi-Fi, USB 3.1 Front,HDMI/DP"

and

"price":1095.9

remembering that depending on the given link the name and price will be different however, always have name . * and price .*

$texto = "
string(43488) "HTTP/1.1 200 OK
Etag: "a6152a2c"
Content-Type: text/html; charset=ISO-8859-1
Content-Length: 188487
X-TIME: 1493043126.194
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
X-Content-Type-Options: nosniff
Access-Control-Allow-Origin: *
Cache-Control: max-age=219, public
Expires: Mon, 24 Apr 2017 14:17:06 GMT
Date: Mon, 24 Apr 2017 14:13:27 GMT
Set-Cookie: incap_ses_297_582873=HDYgfiL0VibiGqTihigfBAcI/lgAAAAAOgjiY0SVKeRwJpG/EqcKgg==; path=/; Domain=.kabum.com.br
Set-Cookie: ___utmvmPwutOXo=yrlhYeFzkwB; path=/; Max-Age=900
Set-Cookie: ___utmvaPwutOXo=XmSBlTw; path=/; Max-Age=900
Set-Cookie: ___utmvbPwutOXo=pZE
    XNfOValo: vtJ; path=/; Max-Age=900
X-Iinfo: 5-61671947-0 0CNN RT(1493043207095 0) q(0 -1 -1 -1) r(0 -1)
X-CDN: Incapsula

      window.lpTag=window.lpTag||{};if(typeof window.lpTag._tagCount==='undefined'){window.lpTag={site:'85687252'||'',section:lpTag.section||'',autoStart:lpTag.autoStart===false?false:true,ovr:lpTag.ovr||{},_v:'1.6.0',_tagCount:1,protocol:'https:',events:{bind:function(app,ev,fn){lpTag.defer(function(){lpTag.events.bind(app,ev,fn);},0);},trigger:function(app,ev,json){lpTag.defer(function(){lpTag.events.trigger(app,ev,json);},1);}},defer:function(fn,fnType){if(fnType==0){this._defB=this._defB||[];this._defB.push(fn);}else if(fnType==1){this._defT=this._defT||[];this._defT.push(fn);}else{this._defL=this._defL||[];this._defL.push(fn);}},load:function(src,chr,id){var t=this;setTimeout(function(){t._load(src,chr,id);},0);},_load:function(src,chr,id){var url=src;if(!src){url=this.protocol+'//'+((this.ovr&&this.ovr.domain)?this.ovr.domain:'lptag.liveperson.net')+'/tag/tag.js?site='+this.site;}var s=document.createElement('script');s.setAttribute('charset',chr?chr:'UTF-8');if(id){s.setAttribute('id
      $(document).ready(function() {
        $('#carousel').flexslider({
              animation: 'slide',
              animationSpeed: 300,
              slideshowSpeed: 4000,
              controlNav: false,
              animationLoop: false,
              slideshow: false,
              itemWidth: 64,
              itemMargin: 5,
              asNavFor: '#slider',
              start:function(slider){
                  $('#slider .flex-direction-nav').remove();
                  $("#imagem-slide li").gkzoom();
              }
          });

          $('#slider').flexslider({
              animation: 'fade',
              animationSpeed: 300,
              controlNav: false,
              animationLoop: false,
              slideshow: false,
              sync: "#carousel",
              start: function(slider){
                 if ($('ul.slides li').size() < 11) {
                       $('ul.flex-direction-nav').remove();
                 }
              }
          });
      });
      $(document).ready(function(){

        var add_dias_uteis = function(date, dias) {
                var copiedDate = new Date(date.getTime());
                var dias_corridos = 0;
                for(i = 0; i < dias; i) {
                    copiedDate.setDate(copiedDate.getDate()+1);
                    if (!(copiedDate.getDay() == 0 || copiedDate.getDay() == 6)) {
                        i++;
                    }
                    dias_corridos++
                }
                date.setDate(date.getDate() + dias_corridos);

                return date;
            };

        $('.cep').mask('99999-999');
        var PATH = 'http://'+window.location.host;

          $("#calcula_frete").on('submit', function(ev){
              if($("#calc_cep").val().length == 9){
                  ev.preventDefault();
                  var id = "#janela1";
                  $('#table-calcular').html("");
                  $("#agendamento_texto").html("");
                  $('#table-cal');
                  var alturaTela = $(document).height();
                  var larguraTela = $(window).width();

                            if(value.valor == 0) {

          dataLayer = [{"productsShelf":[],"productsDetail":[{"position":"1","name":"Placa-M�e ASUS p/ Intel LGA 1151 ATX ROG STRIX Z270E GAMING,DDR4,Aura Sync, �udio Gamer, Rede Intel, SLI/CFX, Wi-Fi, USB 3.1 Frontal,HDMI/DP","category":"Hardware/Placas-m�e/P/ Processador Intel/ASUS","brand":"Asus;","price":1095.9,"id":"84264","available":true}],"visitor":"","pageType":"product","breadcrumb":[{"url":"http://www.kabum.com.br/hardware","name":"Hardware"},{"url":"http://www.kabum.com.br/hardware/placas-mae","name":"Placas-m�e"},{"url":"http://www.kabum.com.br/hardware/placas-mae/p-processador-intel","name":"P/ Processador Intel"},{"url":"http://www.kabum.com.br/hardware/placas-mae/p-processador-intel/asus","name":"ASUS"}]}];

    ";
  • 1

    Boy, the text is all that up there?

3 answers

3

You should remember that in a regex there will always be a delimiter, when you mention that the files will have name. * and price. * is not enough to solve your problems, it just defines where regex should start looking for.

You should always inform the desired result and also mention the inconsistencies that can be found.

Speak only "remembering that depending on the given link the name and price will be different however, you will always have name . * and price . *" is not enough, but I have made a more general response to your problem, try:

("name":".*?")("price":\d*[\.|\,]*\d*)

In short the first Capture Group: ("name":".*?") captures any number of characters including special ones that have "name":" before they and end with "

The second ("price":\d*[\.|\,]*\d*) captures any number of digits (1-9) after "price": which they may have as a separator . or , to decimal place

2


Final code

<?php

function getKabum($urlCompleta) {

    libxml_use_internal_errors(true) and libxml_clear_errors();
    $header = "X-Forwarded-For: {$_SERVER['REMOTE_ADDR']}";
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, "$urlCompleta");
    curl_setopt($ch, CURLOPT_REFERER, "http://www.kabum.com.br");
    curl_setopt($ch, CURLOPT_HTTPHEADER, array($header));
    curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    $html = curl_exec($ch);
    $DOM = new DOMDocument();
    $DOM->loadHTML($html);
    $xpath = new DomXpath($DOM);


    $titulo = $xpath->query('//h1[@class="titulo_det"]')->item(0);
    $preco = $xpath->query('//span[@class="preco_desconto"]')->item(0);
    if (empty($titulo->nodeValue)) {
        preg_match('/(\\"productsDetail\\"\:\[{\"position\":\"1\",\"name\":\"[^\"]+\")/', $DOM->textContent, $t);
        preg_match('/(\\"productsDetail\\"\:\[{\"position\":\"1\",\"name\":\".*?\"),(\"price\":\d*[\.|\,]*\d*)/', $DOM->textContent, $output_array);        
        $title =  substr($t[1], 42, -1);
        $price =  substr($output_array[2],8);

        $titulo->nodeValue = $title;
        $preco->nodeValue = $price;
//    print'<pre>';
//    var_dump($DOM);
//    print'</pre>';
    }
    $retorno = array("titulo" => $titulo->nodeValue, "preco" => $preco->nodeValue);
    return $retorno;
}

$produto [] = getKabum("http://www.kabum.com.br/cgi-local/site/produtos/descricao.cgi?codigo=84264");
$produto [] = getKabum("http://www.kabum.com.br/cgi-local/site/produtos/descricao.cgi?codigo=84404");
//aqui o curl montou diferente
$produto [] = getKabum("http://www.kabum.com.br/cgi-local/site/produtos/descricao.cgi?codigo=75332");
$produto [] = getKabum("http://www.kabum.com.br/cgi-local/site/produtos/descricao.cgi?codigo=63735");
$produto [] = getKabum("http://www.kabum.com.br/cgi-local/site/produtos/descricao.cgi?codigo=85198");
$produto [] = getKabum("http://www.kabum.com.br/cgi-local/site/produtos/descricao.cgi?codigo=41620");
$produto [] = getKabum("http://www.kabum.com.br/cgi-local/site/produtos/descricao.cgi?codigo=34217");
$produto [] = getKabum("http://www.kabum.com.br/cgi-local/site/produtos/descricao.cgi?codigo=77987");
$produto [] = getKabum("http://www.kabum.com.br/cgi-local/site/produtos/descricao.cgi?codigo=63327");

foreach ($produto as $value) {
    if ($value['titulo'] == '') {
        print_r($value);
    }
    print $value['titulo'];
    print "<h1>" . $value['preco'] . "</h1><hr>";
}

2

Seeing your text to analyze along with the PHP tag, I imagine you are doing a CURL.

Suggestion PARSER

The ideal in these cases, because it is an HTML analysis, is to use a parser.
Should you do so I suggest Simple Html Parser.

Suggestion JSON Parser

Analyzing the context of HTML precisely with what you want, it is possible to verify that it is extraction of a data present in a JSON.

dataLayer = [{"productsShelf":[],"productsDetail":[{"position":"1","name":"Placa-M�e ASUS p/ Intel LGA 1151 ATX ROG STRIX Z270E GAMING,DDR4,Aura Sync, �udio Gamer, Rede Intel, SLI/CFX, Wi-Fi, USB 3.1 Frontal,HDMI/DP","category":"Hardware/Placas-m�e/P/ Processador Intel/ASUS","brand":"Asus;","price":1095.9,"id":"84264","available":true}],"visitor":"","pageType":"product","breadcrumb":[{"url":"http://www.kabum.com.br/hardware","name":"Hardware"},{"url":"http://www.kabum.com.br/hardware/placas-mae","name":"Placas-m�e"},{"url":"http://www.kabum.com.br/hardware/placas-mae/p-processador-intel","name":"P/ Processador Intel"},{"url":"http://www.kabum.com.br/hardware/placas-mae/p-processador-intel/asus","name":"ASUS"}]}];

I suggest working with analysis of it. For this just capture it and use json_decode($json, true) so the content becomes a array, and makes it easier to work with him.

Solution by REGEX

If still Uira do by REGEX can use :

("name":"[^"]+")|("price":(?:\d{1,3}.?)+[.,]\d{1,2})

See working in REGEX101.

The fact of returning other tags name is that the search is not very specific being just "name": the exact part.

  • Thanks. However, there are some pages, I don’t know why, but the textContent is coming very different. Is there any way to get the source code of the page with Curl? every time I print $html ex: $html = curl_exec($ch); it mounts the page in my browser. I’ll post a reply to you as my code.

  • Type this view-source:http://www.kabum.com.br/cgi-local/site/productos/descricao.cgi?codigo=75332 so it will always be the same

  • @Herick will actually mount the HTML on your page, because the content you have in your variable $html is an html, and your browser is trying to interpret it. If you want to display html use echo htmlspecialchars($html)

  • you know why this system works on my localhost machine and does not work when I play for the final uolhost server.(BS: on the final server the Kabum site works and the submarine gives this error.) Giving the following msg: Access Denied You don’t have permission to access "http://..."

Browser other questions tagged

You are not signed in. Login or sign up in order to post.