Regular expressions

Asked

Viewed 236 times

3

I have a little difficulty to assemble the regular expressions, I’m trying to work with this code:

<?php 
$url = file_get_contents('http://ciagri.iea.sp.gov.br/precosdiarios/');
preg_match_all($expressao, $url, $conteudo);
echo $conteudo; 
?>

I need to pick up the prices between these codes:

<tr style="background-color:White;">
    <td style="width:170px;">
        Mandioca para mesa
    </td>
    <td style="width:120px;">
        Mogi Mirim
    </td>
    <td align="right" style="width:70px;">
        11,50
    </td>
    <td align="center" style="width:70px;">
        cx.23 kg
    </td>
    <td style="width:200px;">
        <div id="ctl00_ContentPlaceHolder1_gridRecebidos_ctl95_PanelGridObs">
        </div>
    </td>
</tr>
<tr>
    <td style="width:170px;">
        Mandioca para mesa
    </td>
    <td style="width:120px;">
        Pindamonhangaba
    </td>
    <td align="right" style="width:70px;">
        28,00
    </td>
    <td align="center" style="width:70px;">
        cx.23 kg
    </td>
    <td style="width:200px;">
        <div id="ctl00_ContentPlaceHolder1_gridRecebidos_ctl96_PanelGridObs">
        </div>
    </td>
</tr>
<tr style="background-color:White;">
    <td style="width:170px;">
        Mandioca para mesa
    </td>
    <td style="width:120px;">
        Sorocaba
    </td>
    <td align="right" style="width:70px;">
        8,79
    </td>
    <td align="center" style="width:70px;">
        cx.23 kg
    </td>
    <td style="width:200px;">
        <div id="ctl00_ContentPlaceHolder1_gridRecebidos_ctl97_PanelGridObs">
        </div>
    </td>
</tr>

To get the price of each city:

-What would be the best standard to use?

  • You’re looking for content from another web page?

  • Tip: Do not use regex to parse HTML, take a look at Xpath, YQL and htmlSQL

  • 1

    Yes, I am wanting to pick up the quotation of a product that is updated daily. I will take a look at Xpath and YQL.

  • 1

    For PHP there is htmlSQL (https://github.com/hxseven/htmlSQL)

2 answers

5


The ideal is to use XPATH to get these prices. Looking at this page you reported would look like this:

$dom = new DomDocument;
$dom->loadHTMLFile("http://ciagri.iea.sp.gov.br/precosdiarios/");

$xpath = new DomXPath($dom);
// essa query pega o todos os TDs na posicao 3 da primeira tabela com a classe "tabela_dados"
$nodes = $xpath->query("(//table[@class='tabela_dados'])[1]/tr/td[position()=3]");

foreach ($nodes as $i => $node) {
    echo $node->nodeValue . "\n"; // vai imprimir todos os preços
}
  • Thank you, coming home

  • You have the DOM too (which I find simpler)

  • Warning: Domdocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://ciagri.iea.sp.gov.br/precosdiarios/, line: 3654 in This error you know the reason for this error?

  • Add libxml_use_internal_errors(true); at the beginning of the code. Although the warning should work normally.

  • Is there any way I can access a certain position of the $nodes variable, as if it were an array?

  • @Rodolfooliveira if this answer solves your original question, you can mark it as accepted. See more on [tour].

  • @Rodolfooliveira You can access so: $node->item(3); // retorna item na posição 3. Don’t forget to mark the answer that solved your problem as accepted :)

  • @Andréribeiro already marked the answer as accepted. About what I asked up there I still could not understand, would be $nodes->nodeValue(3); or is this item to be an item? 'cause as I know which position is the word I want I wouldn’t need the foreach,

  • @Rodolfooliveira Seria $node->item(3) to get the item in position 3. item is a method.

Show 4 more comments

4

I managed to do with this regex:

<tr[^>]*>\s*<td[^>]*>[^<]*<\/td>\s*<td[^>]*>[^<]*<\/td>\s*<td[^>]*>\s*(\S*)

It is important you capture all the Matches that result.

How this expression works?

We’ll break them into pieces:

  1. <tr[^>]*> - Start with <tr, then use the [^>]> to skip all the way to find one > and consumes the >. I mean, it consumes the <tr blablabla>. Also works if there is only <tr>.
  2. \s* - Consumes a lot of blank spaces and line breaks.
  3. <td[^>]*>[^<]*<\/td>\s* - Start with <td, then use the [^>]> to skip all the way to the > and consumes the >. Keep consuming until you find one more < and then consumes the </td> and the blanks and line breaks that follow. That is, consumes the first <td blabla>blablabla</td>.
  4. Same thing as item 3, will consume the second <td blabla>blablabla</td>.
  5. <td[^>]*>\s* - Consumes the <td blabla> that follows and the blanks and line breaks. Right after that we have the price.
  6. (\S*) - Captures all the characters that follow until you find a blank space (and does not consume the white space). That is, this captures the price.

Tested here. To check, place regex in the first field and g in the second. In the area below put the text where you want to search (in the case of HTML).

  • I didn’t test it because I’m not at home,

  • 1

    @Rodolfooliveira Edited response. :)

Browser other questions tagged

You are not signed in. Login or sign up in order to post.