When fetching the attribute of a tag, it is bringing the value of another

Asked

Viewed 120 times

2

function returnvalue($namearr,$exec){
    $result="";
    foreach($namearr as $arr => $value) {
        $regex = '<.* name="'. $value .'".*>';
        if (preg_match($regex,$exec,$result1) === false) {
            $result4=" FALSO OFFSET MORE HIGHER THAN SUBJECT " . "<br />";
        } else if (preg_match($regex,$exec,$result1) == 0) {
            $result4=" VOID" . "<br />";
        } else  {
            preg_match('/value="[\w]+"/i',$result1[0],$result2);
            $result3 = preg_replace('/value="/i',"",$result2[0]);
            $result4 = substr($result3,0,-1);
        }
        $result.= $value." : ".$result4 . "<br />";
    }
return $result;
}

$exec receives: <input name="fb_dtsg" value="AQHF4anP9ASw" autocomplete="off" type="hidden">

$namearr receives: fb_dtsg

The whole code above is to simply return the value of input. I’ve tested with others names (instead of fb_dtsg, put another) and returned me the correct value.

Turns out here, it’s returning something like: mag_glass, in fact, the expected return is: AQHF4anP9ASw. Only it returns something totally different, in fact, what returns doesn’t even exist in the $exec.

I believe it’s a return from preg_match, but I tried to find out more about the mag_glass and found nothing.

Obs1: $exec receives more things than what I showed here, but for organization purposes I put only the part that matters.

Obs2: I did the hand test with regexr, to see if it reached the correct value, and arrived, but in this function only for the name fb_dtsg the value is different.

  • 1

    That one mag_glass must exist in some other part of the/html text you are analyzing. Can you create a functional example of this? For example in Ideone.

  • 1

    actually yes, it would be this: <input name="init" id="init" value="mag_glass" type="hidden">, but I dismiss possibility, since it does not fit in my regex..

  • 1

    my regex is this: <.* name="fb_dtsg".*>, for me, there would be no way to fit, since the name must be the one defined in the regex..

  • 1

    It must be valid in one of ifs or in the else.

  • 1

    if valid in any if or Else, it would print another value, but defato there is mag_glass as value in the document, but see my regex that I posted above, it would make sense to mark in the document the value as mag_glass?

  • 1

    Maybe the problem is using .* this tries to match anything regardless of size. Why not just use name="fb_dtsg", in my view the .* and <> are unnecessary.

  • I need to get the value of the input with name="x", you know? then I take the entire input that has the name "x", and then I treat it to get to value.. It makes sense what he said, but he will start from < and will fill anything up to get on name="fb_dtsg" and then it will continue, if there is no name, then regext returns 0 since nothing was found inside what I wrote in regex. Look, I may be wrong, but as I recall, I appreciate your help until agr @Qmechanic73! I will review the comments to see if I come to any conclusion.

  • From the comments, I noted what you said about .*, really they were bringing a result different than expected, I put in regex and I was amazed when he picked up a gigantic piece of HTML, which went up to the name fb_dtsg, but the value comes from another, since the second regex that captured the value arrived until the first value= that appears (since I expected to pass to the 2° regex an input and not a large part of the page’s html). Thanks for the touch, I solved the problem!

  • Great =). You will change the .* by some other expression? If possible post a reply describing how you arrived at the solution.

  • The purpose of the above script is to capture the value of any input name, so I developed this regex: <input[\w "\.;:,\{\}_&\$%"=-]* name="'. $value .'"[\w "\.;:,\{\}_&\$%"=-]*>, where $value is the name of the input I want to find out what is the value of the attribute value. This regex meets my needs, but for normal htmls will hardly have anything in the input with :,&.. Grateful ai!

  • 1

    But have you reached a solution? if not, if it is possible to post a snippet of HTML. Use regex in this case can bring very inaccurate results. I could try to post a reply using a parser, it would be a little more readable the code.

  • I arrived yes, you can yes suggest me the use of parser, I have a vague memory but it would be important for me you suggest this way, if it is more precise.

Show 7 more comments

1 answer

1

For this case, there are two solutions: use regex (more complicated and less indicated), or a parser HTML (simplest and most suitable). Both are detailed below.


Regex

The full HTML was not put, but based on the above comments, I was able to simulate the problem with this example:

$namearr = array('name' => 'fb_dtsg');
$html = '<input name="init" id="init" value="mag_glass" type="hidden"><input name="fb_dtsg" value="AQHF4anP9ASw" autocomplete="off" type="hidden">';
echo returnvalue($namearr, $html);

Note that "HTML" has two tags input, and the first has the value="mag_glass", and the second has the name="fb_dtsg".

What happens is that first you use preg_match to check if there is any name equal to "fb_dtsg". In this case, it exists, and so the code falls on the last else function. And within this else you call another preg_match looking for the value. The problem is that the preg_match for default always starts checking from the beginning of the string (see documentation, especially the part that talks about the parameter offset).

That is, its function follows this logic:

  • has name="fb_dtsg" somewhere in the string?
  • if you have, look for value="..." somewhere in the string

The problem occurs if we have more than one input, because the 2 steps above are independent, in the sense that both do their search from the beginning of the string, without knowing the general context (each search above goes looking in the string for the snippet it wants, without knowing if this snippet belongs to the same input that was found by the other).


Therefore, an option to solve the problem would be:

function returnvalue($namearr,$exec){
    foreach($namearr as $arr => $name) {
        $reg_name = 'name="'. $name .'"';
        $reg_value = 'value="([^"]+)"';
        $regex = "/<input[^>]+(?|{$reg_name}[^>]+{$reg_value}|{$reg_value}[^>]+{$reg_name})/";
        $result = preg_match($regex, $exec, $matches);
        if ($result === false) {
            $value =" FALSO OFFSET MORE HIGHER THAN SUBJECT " . "<br />";
        } else if ($result == 0) {
            $value =" VOID" . "<br />";
        } else {
            $value = $matches[1];
        }
        $result.= $name." : ".$value . "<br />";
    }
    return $result;
}

$namearr = array('name' => 'fb_dtsg');
$html = '<input name="init" id="init" value="mag_glass" type="hidden"><input name="fb_dtsg" value="AQHF4anP9ASw" autocomplete="off" type="hidden">';
echo returnvalue($namearr, $html);

First I created two sub-expressions for the name and the value. For the name, the expression will be name="nome_que_foi_passado_para_a_função". In the above case, it will be literally name="fb_dtsg".

For the value, the expression is value="([^"]+)". That is, the string value=", followed by [^"]+. The excerpt [^"] is a character class denied, and in the case means any character that nay be it ". And the quantifier + means "one or more occurrences". I mean, I want the value=" be followed by one or more characters other than quotation marks - so I take everything in there, except the closing quotation marks.

This section is in parentheses, so they form a catch group, whose value I can recover after, as I did with $matches[1] - as it is the first pair of parentheses of regex, so it is the first capture group, so I used index 1.

Then I ride the full regex. Note that in PHP a regex must have delimiters (in this case, it is bars, but other characters are also accepted). In your regex you were not using the bars, which means the characters < and > were functioning as delimiters (they are not part of the regex itself - in this case "worked" by coincidence, but it is important to note that the delimiters only serve to indicate that among them has a regex, but they are not part of the expression itself).

One detail is that regex explicitly has <input, because it’s the tag you want to capture. When using .*, you say that regex can take anything, because this expression means "zero or more occurrences of any character".

The regex uses alternation (the character |, which means or), to cover two possibilities: the name may be either before or after the value (since in HTML I can have these attributes in any order). So the general structure of regex is:

  • the string <input, followed by [^>]+ (one or more characters that nay be the >, so I guarantee that regex will not "invade" other tags)
  • the following toggle follows the format (?|name[^>]+value|value[^>]+name). That is, the name may be either before or after the value, and between them I have [^>]+ to ensure that regex does not leave the tag
  • i use (?| indicating a branch reset. I did it because the expression of value has a capture group, but since it appears twice in the regex, it would create two groups (and I would have to make a if to know if it was the group 1 or 2 that was found). But how each group is in one of the alternatives (and only one of them can give match at a time), I can use the branch reset, that they will always be group 1.

The exit is:

1fb_dtsg : AQHF4anP9ASw<br />

Note also that you do not need to call preg_match twice in a row, like you did in the first two if's. Call once and save the return, and ready.

Another detail is that this regex will only give match if the value have something. If you want to take also the cases of value="", just change the + for *:

$reg_value = 'value="([^"]*)"';

For the asterisk means "zero or more occurrences".


Parser de HTML

Use regex to do Parsing html is not ideal (read more about this here). For simpler cases, it may even work (even if in this case, although HTML is simple, regex is not), but in this case I find it much simpler to use an HTML-specific API, such as the DOMDocument. See how I’d look to find the input's:

$dom = new DOMDocument;
$dom->loadHtml($html);
$xpath = new DOMXPath($dom);
$name = "fb_dtsg";
foreach($xpath->query('//input[@name="'. $name. '"]') as $input){
    echo $input->getAttribute('value');
}

In case, it searches all fields input that have the given name, and take their respective value.

This solution is not only simpler, but also treats several cases that the above regex lets pass.

For example, if the input be within comments:

<!--
<input name="fb_dtsg" value="AQHF4anP9ASw" autocomplete="off" type="hidden">
-->

In this case, the regex finds the match even so (see), already the DOMDocument ignores comments correctly. And do a regex to detect if the input is inside a comment is quite complicated (you would have to join the above regex with something similar to that, for example) and in my opinion not worth it.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.