Picking up text within a <ul> PHP list

Asked

Viewed 373 times

1

Hello, I am creating a project I need to get some texts within a list on another site, I will show you how I need;

I’m using the file_get_contents;

$url = "www.site.com";
$html = file_get_contents($url);
$getTextoList = "";
preg_match_all($getTextoList, $html, $GetTexto);
$GetTexto = str_replace(" ", ", ", $$GetTexto[1][0]);

The site has a list like this:

<div class="listdoTexto">
<ul>                                                                    
<li class="texto"><a href="textolink">texto</a></li>
<li class="texto"><a href="textolink">texto</a></li>
<li class="texto"><a href="textolink">texto</a></li>
<li class="texto"><a href="textolink">texto</a></li>
</ul>
</div>

I wanted to get just the "text" that this between <a></a> of that list.

  • I think this might help you: http://php.net/manual/en/book.dom.php

  • @thiagoalessio Have some example?

1 answer

0


You can use regex on preg_match_all:

|<div class="listdoTexto">[\w\W]*<\/ul>|

Where [\w\W]* will pick any character between <div class="listdoTexto"> and <\/ul>:

\w -> qualquer caractere alfanumérico e underscore "_")
\W -> qualquer caractere NÃO alfanumérico e underscore "_")
*  -> uma ou quantas ocorrências houverem entre o último </ul> até <div class="listdoTexto">

This will return me an Array in the index [0], where it contains the output you want. In this case I convert the Array to string with implode:

$GetTexto = implode(",", $GetTexto[0]);

However we may have a problem if there is more list in HTML. Example:

         |→ <div class="listdoTexto">
         |    <ul>
Só quero |       <li class="texto"><a href="textolink">texto1</a></li>
pegar    |       <li class="texto"><a href="textolink">texto2</a></li>
esta     |       <li class="texto"><a href="textolink">texto3</a></li>
parte... |       <li class="texto"><a href="textolink">texto4</a></li>
         |→    </ul>
            </div>

           <div class="listdoTexto2">
              <ul>                
..mas o          <li class="texto"><a href="textolink">texto5</a></li>
regex irá          <li class="texto"><a href="textolink">texto6</a></li>
até aqui →    </ul>
           </div>

That is, the result of $GetTexto after the implode would be this:

<div class="listdoTexto">
   <ul>
      <li class="texto"><a href="textolink">texto1</a></li>
      <li class="texto"><a href="textolink">texto2</a></li>
      <li class="texto"><a href="textolink">texto3</a></li>
      <li class="texto"><a href="textolink">texto4</a></li>
  </ul>
</div>

<div class="listdoTexto2">
   <ul>
      <li class="texto"><a href="textolink">texto5</a></li>
      <li class="texto"><a href="textolink">texto6</a></li>
  </ul>

How I want to catch only until the first </ul>, can I use substr with strpos:

$GetTexto = substr($GetTexto, 0, strpos($GetTexto, "</ul>"));

The result now is this:

<div class="listdoTexto">
   <ul>
      <li class="texto"><a href="textolink">texto1</a></li>
      <li class="texto"><a href="textolink">texto2</a></li>
      <li class="texto"><a href="textolink">texto3</a></li>
      <li class="texto"><a href="textolink">texto4</a></li>
  </ul>

Since I only want the text, I use strip_tags to delete the tags:

$GetTexto = strip_tags($GetTexto);

It will return only the text, but with line breaks and possible spaces before, after or between the texts:

texto1
texto2
texto3
texto4

Can I use preg_replace with trim to replace line breaks and unwanted spaces with ,,, which will later be used in a replace:

$GetTexto = preg_replace("/\s{2,}|\n/", ",,", trim($GetTexto));

Now we have:

texto1,,texto2,,texto3,,texto4

Now to separate the texts with comma and space, you can use str_replace replacing the ,,:

$GetTexto = str_replace(",,", ", ", $GetTexto);

Final result:

texto1, texto2, texto3, texto4

Although I get to the end result, I don’t know if that would be the best approach. There may be a method using Document Object Model more efficient, but I hope it helps.

Code:

$url = "http://www.site.com";
$html = file_get_contents($url);

$getTextoList = '|<div class="listdoTexto">[\w\W]*<\/ul>|';
preg_match_all($getTextoList, $html, $GetTexto);

$GetTexto = implode(",", $GetTexto[0]);
$GetTexto = substr($GetTexto, 0, strpos($GetTexto, "</ul>")); // 
$GetTexto = strip_tags($GetTexto);
$GetTexto = preg_replace("/\s{2,}|\n/", ",,", trim($GetTexto));
$GetTexto = str_replace(",,", ", ", $GetTexto);

Testing at Ideone

  • Can you help me, I was getting an SSL error, so fix by putting $arrContextOptions=array(&#xA; "ssl"=>array(&#xA; "verify_peer"=>false,&#xA; "verify_peer_name"=>false,&#xA; ),&#xA;); only that I put the url and do not receive anything in the input, where I put to generate this value.

  • $html = file_get_contents($url, false, stream_context_create($arrContextOptions));

  • I did some testing and the html is coming, plus there’s something wrong here:

  • $GetTexto = implode(",", $GetTexto[0]);&#xA; $GetTexto = substr($GetTexto, 0, strpos($GetTexto, "</ul>")); // &#xA; $GetTexto = strip_tags($GetTexto);&#xA; $GetTexto = preg_replace("/\s{2,}|\n/", ",,", trim($GetTexto));&#xA; $GetTexto = str_replace(",,", ", ", $GetTexto);

  • If you have other Ivs in front of the div list it also works?

  • 1

    Oops! Man, it does work. What went wrong?

  • Then I saw something inside the htmlt term " " and ' ' I think you’re wrong about that part.

  • 1

    I will analyze here...

  • Look here, I need to get these tags from the https://ideone.com/jyf99svideo

  • In Ideone the wrong tah url, see: https://w...content-available-to-Author-only...s. com....

Show 6 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.