Get value from external HTML TAG "<link>"

Asked

Viewed 4,814 times

10

I need to take the value (or values if you have more than one) from the TAG <link> of an HTML from another site.

Trying:

$url = 'http://localhost/teste/';
$content = trim(file_get_contents($url));
preg_match("/<link(.*?)>/i",$content,$return); 
var_dump($return);

Return:

array (size=2)
  0 => string '<link rel="shortcut icon" href="http://localhost/teste/icon.png">' (length=77)
  1 => string ' rel="shortcut icon" href="http://localhost/teste/icon.png"' (length=71)

I don’t know if I made myself clear, but I’d like you to return the following:

array (size=1)
  0 => 
    array (size=2)
      'rel' => string 'shortcut icon' (length=13)
      'href' => string 'http://localhost/teste/icon.png' (length=31)

4 answers

8

Try fetching the data within an HTML by navigating the DOM, not using regular expressions. It may happen that, hypothetically, there is a link within another link and because of that, his expression fails.

There is a relatively old post -but quite well known- about why regular expressions are not used to interpret an HTML. Basically, HTML is not a regular language and by definition could not be interpreted by a regular expression.

http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html

This, of course, if we are talking about a situation where you can browse the HTML DOM (as you are using PHP, is valid).

My solution then follows the following:

<?php
$html = trim(file_get_contents('http://localhost/teste/'));
$dom = new DOMDocument;
$dom->loadXML($html);
$links = $dom->getElementsByTagName('link');
foreach ($links as $link) {
    print_r($link->getAttributes());
}

6


In fact, the regular expression (since working with HTML should be done with DOM) more comprehensive and consequently more appropriate would be:

/<link.*?href="(.*?)".*?>/i

Since:

  • Given the classification of stack as PHP and use demonstration with preg_match(), the modifier g does not exist among those supported by PCRE modifiers available.

  • According to HTML and XHTML specifications the <link> tag has no value, only attributes, differing mainly by tag closure.

  • One should consider that not always the attribute href desired will have its value in the same position, not even if you wrote the HTML. So the consideration of whether there is anything before and after the attribute.

As for the use, to capture all the values, just use preg_match_all().

[EDIT]

As pointed out by @Sergio, with the edition of stack initial the above solution no longer applies, however, the explanation contained herein is of great value and only for that reason permance.

I will be removing, however, whatever is superfluous. Content that may be available in revisions to that reply (assuming it is a global resource).

I ask you to read carefully and understand how everything gets more complicated when you try to screw using a hammer:

  1. First we change the Regular Expression to find all attributes.

  2. Since PHP does not capture "groups of groups" automatically, that is, you define something to be captured and it captures as many occurrences of this pattern as there are, you need to separate each key=value pair.

    With PHP one does much in many ways and a viable alternative would be to remove the spaces between key pairs=value and use parse_str(). But how to do that we would need an ER, since a str_replace() simple mess, for example, the rel, do everything for ER.

  3. We have to iterate the array produced by preg_match_all(), this is inevitable, but as I will be applying the same routine, on each element of the array, mapping its data into something else, I prefer to use array_map():

  4. preg_split() does her service, but even if she delivers an array, it is not in the format you need, having the attributes as index. We can get around with array_chunk():

  5. But array_chunk() produces N arrays inside another we already had, which in turn is inside another. OMFG! I don’t want to iterate all this! In this case, a sensational trick is to transpose the matrix and, for this, probably the best voted practical response I have ever seen of that stack in the English OS.

When you transpose this matrix, it looks like this:

array (size=2)
  0 => 
    array (size=2)
      0 => string 'rel' (length=3)
      1 => string 'href' (length=4)
  1 => 
    array (size=2)
      0 => string 'shortcut icon' (length=13)
      1 => string 'http://localhost/teste/icon1.png' (length=32)

Structure that a array_combine() can easily handle:

The complete code can be copied and viewed running through of that link.

4

3

I recommend using the PHP Simple HTML DOM Parser, it is great and very easy to use, I use in various scripts to analyze HTML from other sites.

Very good the answer of Bruno Augusto, I just want to complement his reply and give some more details that I think are important to be observed and taken into account. When I need to analyze HTML content and use regular expression for this, I try to make a more complete code because HTML is very irregular, the attributes have no defined order, and can have code with line breaks, I suggest using a more "complete" regular expressionin your case I would use this regular expression:

/<link.*?href=\"([^\"]*?)\".*?\/?>/si

Basically the improvements are 2 replacements:

1 - of (.*?) for ([^\"]*?) because it is the right thing to do, because there are no characters " if the attribute delimiter is also ", the same goes if it were the character '.

2 - of > for \/?> because there may or may not be the character / before the character <.

3 - of /i for /si as there may be line breaks between attributes, values, etc... not always the HTML tags on the sites are fully inline, may be a piece on one line and another piece on the other line.

If you use the original regular expression suggested by Bruno Augusto, it may not find certain LINK tag codes if they have broken lines or have carectere / (bar, representing the closing tag), example:

$string = <<<EOF
<link
rel="shortcut icon"
href="http://localhost/teste/icon.png"
>
EOF;

if ( preg_match_all( '/<link.*?href="(.*?)".*?>/i', $string, $matches, PREG_SET_ORDER ) ) {
    var_dump( $matches );
    die();
} else {
    echo 'Nenhuma tag encontrada.';
    /* Esta parte será executada pois não serão encontrados tags, devido as quebras de linhas e adicionalmente também há a presença do caractere "/" (barra) do fechamento da tag LINK */
}

Now using the same example code with the most complete regular expression suggested by me, the results will be obtained successfully:

$string = <<<EOF
<link
rel="shortcut icon"
href="http://localhost/teste/icon.png"
>
EOF;

if ( preg_match_all( '/<link.*?href=\"([^\"]*?)\".*?\/?>/si', $string, $matches, PREG_SET_ORDER ) ) {
    /* Tags encontradas com sucesso */
    var_dump( $matches );
    die();
}

Browser other questions tagged

You are not signed in. Login or sign up in order to post.