Grab only some attributes of HTML tags

Question

Grab only some attributes of HTML tags

Asked 5 years, 2 months ago

Viewed 531 times

2

I have several inputs, from a curl, but I just need the name and of value of these inputs:

<input type="hidden" autocomplete="off" name="timezone" value="180" id="u_0_9">

I have this code:

preg_match_all('/name="(.*?)" value="(.*?)"/is', $response, $names);

Only this code brings me wrong records, sometimes what I need it doesn’t show me, I just want name and value.

Remembering not all inputs are like this, have some that are like this:

<input type="hidden" name="jazoest" value="2703" autocomplete="off">

Does not have id, or other attributes, but what I want is just name and value.

Any hint?

What does it mean registros errados, Do you have an example? It seems to me that with the two examples you gave is taking: https://regex101.com/r/c0IvzV/1

– Gabriel Santos

2019/04/06 at 03:32
and the code test also takes: https://gist.github.com/ogabrielsantos/f020c1728d8fd9b06aefa8797e5d8ab2

– Gabriel Santos

2019/04/06 at 03:35
@Gabrielsantos, yes he was picking up other things besides the inputs, I solved using array_combine.

– Banks

2019/04/06 at 07:47

1 answer

Browser other questions tagged php html regex preg-match

You are not signed in. Login or sign up in order to post.

by hkotsubo • **55,826** points · Answer 1 · 2019-04-06T17:44:13+00:00

As you mentioned the curl, I am assuming that you have a string containing all the HTML. In this case, an alternative is to use the GIFT EXTENSION:

$html = <<<HTML
<html>
<body>
<input type="hidden" autocomplete="off" name="timezone" value="180" id="u_0_9">
<input type="hidden" name="jazoest"
 autocomplete="off" value="2703">
<input type="radio" value="value" name="test" >
</body>
</html>
HTML;

$doc = DOMDocument::loadHTML($html);
// obter todos os elementos input
$inputs = $doc->getElementsByTagName('input');
foreach ($inputs as $input) {
    // obter o name e value
    echo $input->getAttribute('name'). '='. $input->getAttribute('value'), PHP_EOL;
}

The exit is:

Timezone=180
jazoest=2703
test=value

If you want, you can restrict the type of input doing something like:

if ($input->getAttribute('type') == 'hidden') {
    // só pega os valores se o campo for type=hidden
}

So you just take the fields hidden, for example. Another alternative to restrict the search is to use DOMXPath:

$doc = DOMDocument::loadHTML($html);
$xpath = new DOMXPath($doc);
// obter somente input com type="hidden"
$entries = $xpath->query('//input[@type = "hidden"]');
foreach ($entries as $entry) {
    echo $entry->getAttribute('name'). '='. $entry->getAttribute('value'), PHP_EOL;
}

In this case, it only looks for input's that possess the type="hidden". The exit is:

Timezone=180
jazoest=2703

Using regex

The above solution I find simpler than using regex. Not that it is "wrong" to use regex, just think that in this case I think it is more appropriate to use an HTML parser (be the DOMDocument, be any other from this list), since he can handle several situations that are much more difficult to deal with regex.

For example, the regex you used: name="(.*?)" value="(.*?)". She’ll just take the name and value if they are exactly in this order and with a space separating them.

In the test I did above, notice that I purposely placed the attribute autocomplete between the name and value in the second input (and even put a line break in the middle of the tag, which is allowed in HTML), and in the third input I reversed and put the value before the name. Then this regex would only take the first input.

Of course, I could arrange that in a single giant regex, but there are too many details to worry about if you want an expression that considers every possible case. For example, to read the 3 tags I put in the previous example, taking the name and value, would look something like this:

$html = <<<HTML
<html>
<body>
<input type="hidden" autocomplete="off" name="timezone" value="180" id="u_0_9">
<input type="hidden" name="jazoest"
 autocomplete="off" value="2703">
<input type="radio" value="value" name="test" >
</body>
</html>
HTML;
$pos = 0;
// procura por uma tag input
while (preg_match('/<input\b([^>]*)>/i', $html, $match, PREG_OFFSET_CAPTURE, $pos)) {
    $conteudo = $match[1][0]; // conteúdo da tag input atual
    $pos_conteudo = 0;
    $type = '';
    $name = '';
    $value = '';
    // busca os atributos da tag input atual
    while ($pos_conteudo < strlen($conteudo) &&
           preg_match('/\b(type|name|value)\s*=\s*(?|"([^"]*)"|\'([^\']*)\'|([^\'">\s]*))/i',
                      $conteudo, $match_conteudo, PREG_OFFSET_CAPTURE, $pos_conteudo)) {
        $attr_name = strtolower($match_conteudo[1][0]);
        if ($attr_name == 'type') {
            $type = $match_conteudo[2][0];
        } else if ($attr_name == 'name') {
            $name = $match_conteudo[2][0];
        } else if ($attr_name == 'value') {
            $value = $match_conteudo[2][0];
        }
        // continua buscando os atributos de onde parou o anterior
        $pos_conteudo = strlen($attr_name) + $match_conteudo[1][1];
    }
    // a próxima chamada de preg_match começa depois do conteúdo já encontrado
    $pos = strlen($conteudo) + $match[1][1];
    echo $type. '='. $name. '='.$value, PHP_EOL;
}

The exit is:

Hidden=Timezone=180
Hidden=jazoest=2703
radio=test=value

Again, you can do if ($type == 'hidden') or something like that if you want to restrict some kind of input specific (or change in the above regex itself, if you like).

Basically, this solution takes advantage of the fact that preg_match can receive as parameter the position at which the search will begin. And using the option PREG_OFFSET_CAPTURE, the position in which the match is also found is returned in the array of pouch (so I can know where the tag was found, and I can continue the search from there).

Note that I don’t use much .*?, since the point corresponds to any character (and how you used the option s, this makes the point also consider line breaks). Although .*? look like a simple solution that "works", she has her price.

Basically, how .* matches any characters, it can go beyond the current tag (for example, if any tag has name but has not value). Ex:

$html = <<<HTML
<input type="hidden" name="semvalue">
<input type="hidden" value="abc">
HTML;

if (preg_match_all('/name="(.*?)" value="(.*?)"/is', $html, $matches)) {
    var_dump($matches);
}

With the option s, the dot considers line breaks. With this, the regex looks for name=" and then go looking for all the characters in the string (including line breaks), until you find " value=". So she ends up "hacking" another tag, and the result is:

array(3) {
  [0]=>
  array(1) {
    [0]=>
    string(49) "name="semvalue">
<input type="hidden" value="abc""
  }
  [1]=>
  array(1) {
    [0]=>
    string(30) "semvalue">
<input type="hidden"
  }
  [2]=>
  array(1) {
    [0]=>
    string(3) "abc"
  }
}

Another thing that the .*? can do is keep going back and forth in the string, checking all possibilities, making the regex more inefficient, especially when you are in a stretch that does not correspond to the expression. Already using [^"], at least you guarantee that the regex will stop when you find the character ", which already greatly diminishes this inefficiency, aleḿ to ensure that it will not "hack" other tags.

This is an important detail: in regex is much better when you says exactly what you want and what you don’t want. In case, I don’t want "anything" (.*). Within a tag, for example, what I actually want is "anything other than closing the tag". That is, any character other than the >. That’s why I used:

<input\b([^>]*)>

The excerpt [^>]* means "zero or more characters than >". All that stands between [^ and ] is a character class denied, that is, everything that is different from what is inside. As inside only has the >, then this snippet takes everything that is not the closing of the tag. This even considers line breaks, which makes the option s unnecessary. But I still kept the option i, because tags are case insensitive (that is, the regex catches so much input how much INPUT).

And since this section is in parentheses, it forms a catch group, that I can recover with $match[1][0]. The $match[1] refers to the first catch group (as it is the first pair of parentheses of the regex), and $match[1][0] contains the string that was captured by regex. Already $match[1][1] contains the position in which it match was found (this information is only available when you pass the option PREG_OFFSET_CAPTURE).

Anyway, this first regex serves to get all the contents of a tag input. Then I use another regex to fetch the attributes of the current tag, so I use the content of the found tag instead of the entire HTML. This restricts the search space and prevents regex from accidentally starting to look at other tags (which can happen depending on the case, if I use .*, for example).

The excerpt (type|name|value) search for string "type", or "name", or "value" (the character | means or), and is within parentheses to form a capture group, so I can get the string that was captured. Then we have zero or more spaces (\s*), equal signal, zero or more spaces, and then we have 3 options:

a string delimited by double quotes: "([^"]*)" or
a string delimited by single quotes: \'([^\']*)\' (for it is permissible to have name='nome') or
zero or more characters other than single quotes, double quotes or spaces ([^\'">\s]*) (since we can have things like required or name=nome)

Note that the contents are also in parentheses, which forms another capture group. In the case of the values in quotes, I left the quotes themselves outside the capture group, so I guarantee that I will only have the value of each attribute.

As there is more than one possibility (between double quotes, single quotes or without quotes), I use the (?|, indicating a branch reset, that is, what is captured will be group 2 (otherwise, one would be group 2, the other would be group 3, etc, and I would have to check which is empty and which is not to know which was captured - with the branch reset, just check the group 2).

Then I check which attribute name was captured (type, name or value) and take the respective value. I update the position to the next preg_match and continue the loop, looking for the next attribute of the tag. When the content of the tag ends, I print the type, name and value found (and they will be empty if not found) and back to the while external, which continues to search for the next input.

It would even be possible to make a single regex that recognizes a tag at the same time input and their respective attributes type, name and value, in any order. But it would be too big (it would be a junction of the two regex above with a few more alternations to ensure that the attributes can be in any order), but frankly, it would be too complex and in my opinion it is no longer worth it.

As you can see, it’s much easier to use a parser html. Its regex may even work for simpler cases, but complicate a little the HTML and the complexity of regex gradually increasing, until it becomes impractical.

For example, if we have value="<abc>", the regex of the first while It no longer works, because it takes everything between the <input and the first > what to find. Then we would have to check whether the > is not in quotes, to make sure that it is the closing tag.

Another case is when you have a tag inside comments:

<!-- <input type="hidden" name="test2" value="ops"> -->

The regex above takes this tag too, because we don’t check to see if it’s inside a comment. Just remembering that the comment can extend over several lines and have several other tags and contents within it, so checking this with a regex is nothing trivial.

The HTML parser takes care of all these special cases for you. The solution with DOMDocument, for example, get correctly the values you have > and ignores the comments, without needing any change in the code. Then evaluate whether regex is in fact the best solution for your case. 'Cause when it comes to Parsing html, is not always.