Expression to treat URL with parameter

Asked

Viewed 1,248 times

1

NOTIFYING

The solutions below, only will not have great serventias, with them it is necessary to have previously already implemented URL friendly, and also checks of display of the contents of the system that will be implemented, Here I will only give examples of entries of full addresses received in the URL, for each project in specific the system postterm should already have a treatment of receiving parameter so with mine or try to base on what the function can offer to implement it. But after all, what was her goal then?

If your system is no longer downloading data if you are asked for a "my-site//an*i#maç! to" and with it brings something like "page not found..."

Great! The function will serve and will try to redirect if the server itself no longer blocks with error 400

OBJECTIVES:

1) When no search parameters are available via GET type "?"

meu-site///an*i#maç!ao//##minha-postagem/

for that reason!

meu-site/animação/minha-postagem

utilise:

if(preg_replace('/[^A-Za-z0-9-\/]/','',$url)){  
       //remove caracteres especiais permitindo hífen,e barra
      $nova_url= preg_replace('/[^A-Za-z0-9-\/]/','',$url);
}


if(preg_replace(('/(\/)\1+/',$url)){
      //remove varias barras seguidas
     $nova_url= preg_replace('/(\/)\1+/','$1', $url);
}

***** Improvement Request: (could remove bar if there were at last position) *****

2) When I have only the search parameter via GET of type "?" without paging:

meu-site/@#$%%@?sWXX=palavra

for that reason!

meu-site/?s=palavra

Implemented (function created and explained below)

 preg_match('/^.*?\?.*?s.*?=$/', $str)) {return "?s=";}

3) Now when I have paging and search parameter together

 meu-site/!anim a cao/!!!page/@#$%%@?sWXX=palavra

for that reason!

meu-site/animacao/page/2?s=palavra

Implemented (function created and explained below)

 if (preg_match('{^((?:[a-zA-Z][^a-zA-Z]*)+)/.*?p.*?a.*?g.*?e.*?/.*?(\d+).*?\?.*?s.*?=(.*)$}', $str, $matches)) {
        $trecho1 = preg_replace('/[^a-zA-Z]/', '', $matches[1]);
        return "{$trecho1}/page/{$matches[2]}?s={$matches[3]}";
    }

Anyway, I know that there are many possibilities of data entries, but as I said these expressions serve to help a URL typed "erroneously" to arrive at the intended content, let’s say in passing improving even a possible duplication of content! This is what I needed but if you can improve, those who want to help modify my own functions, comment and help improve the code!

BELOW COMES THE REQUEST AND EXPLANATIONS

I need a regular PHP expression that preserves my search parameter. My parameter on the site calls s (abbreviation of search, just like most websites), soon when I go to do a search on the site via GET the browser will interpret like this:

meu-site/?s=

Basically what I need is if there is anything before the parameter ?s= or anything in the middle of it, the preg_replace remove and return me the string in the clean way above.

Come on then, what do I need that expression to do:

First expression

  1. If there is no parameter other than "? s=" in the string do nothing if it is like this!

    ?s=
    

Otherwise remove everything that is between it (after no need, the browser understands it as the search word), and everything that is before it! Allowing only characters a-z, numbers 0-9, 1 single bar (not followed) and hyphen, like the expressions I made below ex:

Entree:

to) ???!!?s=

b) ///()!?s=

c) !?s!=

Exit:

?s=

Second expression

  1. Now if you have more parameters in the URL than s (don’t worry I already check it all) just interpret the input string.

NOTE: for paging, I don’t even use the bar after the pagination number...

animacao/page/2?s=

Entree:

1) animacao/!pa*ge)*&/2?s=

2) animacao////!?page))/2?s!=

3) animacao/!?pa_ge))/!@2?s*!=

Exit:

animacao/page/2?s=

These expressions I made work well throughout the site, only only if there is no search parameter via GET as for example "? s=", finally I use to treat:

<?php 
//remove caracteres especiais,permitindo apenas / e hífen
preg_replace('/[^A-Za-z0-9-\/]/','',$url);

//remove barras varias barras
preg_replace('/(\/)\1+/','$1', $url);

//remove várias interrogações seguidas
preg_replace('/([?])\1+/','$1', $url);
?>

Update of request in relation to comments

The function that the User "hkotsubo" offered is according to what I need, reading his excellent explanation about it, I tested and identified an improvement if possible.

The value that the expression receives in "$Matches[1]", I believe that in this case the word "animacao" (I may be wrong) is not working house have space before, between or after the word.... and special characters...

What I need:

If possible allow hyphenation, and if there is space or special characters in any position remove. The hyphen I think I was able to add in this "a-za-Z-" section now the space needs a treatment...

Input:(space)

1) ani mac ao /page/2?s=

2) a!!@*ni-ma-cao!/page/2?s=

Exit:

animacao/page/2?s=
  • 1

    Doubts: sQ!= flipped s= - why the Q some? Shouldn’t it be sQ=? Or you can only have a parameter called s? And why ![]}page!)///2?s= flipped page/2?s=, but !!@SSDwe?s= flipped ?s=? What is the criterion for page not disappear and SSDwe disappear? You can only have the s, or may have ?s=a&xyz=abc&etc=123....? Anyway, it was not clear to me all the criteria...

  • I also got the question of @hkotsubo .... I think you could better exemplify.. Type: If the url looks like this: meu-site/!!@SSDwe?s= I want it to stay that way: meu-site/?s= for example... if she’s "roasted" ... I want it to stay that way... and such...

  • yes I’ll explain it to you...

  • Edited, anything ask me.

2 answers

3


I don’t know if I understand exactly all the rules and maybe I’m oversimplifying the problem, but come on:


Parameter s

For the parameter s, you want the end result to be the string ?s=, and wants to delete all other characters that are not part of this string.

So actually you don’t need to make fancy replacements, just check to see if there’s a ? somewhere in the string, then check if there is a s, and whether at the end there is a =:

function ajusta_parametro($str) {
    // se tiver algum "?", depois algum "s", e termina com "="
    if (preg_match('/^.*?\?.*?s.*?=$/', $str)) {
        return "?s=";
    }
    // se não estiver no formato correto, retorna '' ou dá erro, você decide
    ...
}

I used the markers ^ (string start) and $ (end of the string), to ensure that the whole string has the format I want. I then used .*? (zero or more occurrences of any character). Remembering that the .* is greedy and always picks up as many characters as possible, but placing the ? Then I cancel this behavior, and the regex checks the smallest possible sequence. (in this specific case I think it doesn’t make much difference, maybe it’s slightly faster because you don’t need to catch the largest possible sequence).

Then I have \? (the "question mark" character itself), followed by zero or more characters, followed by s, followed by zero or more characters, followed by = and the end of the string. That is, regex checks if you have any ? in string, and if then there is a s (may or may not have anything in between), and whether there is a later = (may or may not have anything before him).

If the string has this format, simply return ?s=. Otherwise, you decide whether to return something (empty string maybe) or show an error message.

With this, all the strings below return ?s=:

echo ajusta_parametro('???!!?s='), "\n";
echo ajusta_parametro('///()!?s='), "\n";
echo ajusta_parametro('!?s!='), "\n";
echo ajusta_parametro('!?sQ!='), "\n";

Paging

For pagination, we can use a similar reasoning. From what I saw, the format animacao/page/2, I’m gonna assume that’s:

  • any text (sequence of letters)
  • /page/number(s)

For the text, we can use [a-zA-Z]+. page will always be fixed and I use something similar to what I did for the parameter s (see if there’s a p, followed by anything, followed by a, etc). And for the number, I use \d+ (one or more digits).

Another detail is that in PHP usually regex examples use /regex/ (with bars delimiting the expression). If I want to put a bar inside the expression, I have to write as \/. But I can also change the delimiters to anything, and with that I don’t need to escape the bar with \. The regex looks like this:

function ajusta_page($str) {
    // se tiver algum texto/page/número, podendo ter vários caracteres entre eles
    // usei { } para delimitar a regex (em vez de / /)
    if (preg_match('{^([a-zA-Z]+)/.*?p.*?a.*?g.*?e.*?/.*?(\d+).*?$}', $str, $matches)) {
        return "{$matches[1]}/page/{$matches[2]}";
    }
    // se não estiver no formato correto, retorna '' ou dá erro, você decide
}

In that case, I put [a-zA-Z]+ and \d+ in parentheses as this forms a catch group, which are placed in the variable $matches. Thus, $matches[1] corresponds to the first parentheses (in the case, [a-zA-Z]+, page name), and the $matches[2] corresponds to the second parentheses (\d+, page number). I put it all together and the result will be "name/page/number".

All cases below print animacao/page/2:

echo ajusta_page("animacao/!page)&/2"), "\n";
echo ajusta_page("animacao////!?page))/2"), "\n";
echo ajusta_page("animacao/!?pa_ge))/!@2"), "\n";

Now just put it all together

Joining the two regex into one, we have:

function ajusta($str) {
    if (preg_match('{^([a-zA-Z]+)/.*?p.*?a.*?g.*?e.*?/.*?(\d+).*?\?.*?s.*?=$}', $str, $matches)) {
        return "{$matches[1]}/page/{$matches[2]}?s=";
    }
    // se não estiver no formato correto, retorna '' ou dá erro, você decide
}

All cases below result in animacao/page/2?s=:

echo ajusta("animacao/!page)&/2???!!?s="), "\n";
echo ajusta("animacao////!?page))/2///()!?s="), "\n";
echo ajusta("animacao/!?pa_ge))/!@2!?s!="), "\n";
echo ajusta("animacao/!?pa_ge))/!@2!?sQ!="), "\n";

If you want me to have more after the = (and whatever those things are), just add .* after it. But if you want to join these things in the string, use another capture group:

function ajusta($str) {
    // se tiver algum texto/page/número, podendo ter vários caracteres entre eles
    if (preg_match('{^([a-zA-Z]+)/.*?p.*?a.*?g.*?e.*?/.*?(\d+).*?\?.*?s.*?=(.*)$}', $str, $matches)) {
        return "{$matches[1]}/page/{$matches[2]}?s={$matches[3]}";
    }
    // se não estiver no formato correto, retorna '' ou dá erro, você decide
}

So strings like animacao/!?pa_ge))/!@2!?sQ!=valor&param=outrovalor turns into animacao/page/2?s=valor&param=outrovalor.


Initial snippet with special characters

In the case later added in the question, the first part ("animation") can also have several non-alphanumeric characters between the letters. If the final text is always "animation", just use an approach similar to what was done with "page": put the letter a, followed by .*?, followed by n, followed by .*?, and so on.

But if you can get any text, we can use ((?:[a-zA-Z][^a-zA-Z]*)+). Explaining from the inside out:

  • (?: creates a catch group. This makes the parentheses not add a new value in $matches - if I only used ( (without the ?:), this would change the indexes of the other parentheses of the expression ($matches[2], $matches[3], etc - each extra pair of parentheses adds a new index in $matches, and if I don’t want it, just use (?:).
  • [a-zA-Z][^a-zA-Z]* is "a letter followed by zero or more characters that are not letters". All of this is within the non-sampler group and followed by a +, i.e., I may have several times that sequence of "letter followed by non-letters"
  • all this is within parentheses, to form a capture group (in case, it will be $matches1).

The rest of the expression (to handle "pages/X/?s=") is the same. The only detail is that $matches[1] will have the first part of the URL along with the "non-letters" characters, so I need a preg_replace additional before constructing the final string:

function ajusta($str) {
    // se tiver algum texto/page/número, podendo ter vários caracteres entre eles
    if (preg_match('{^((?:[a-zA-Z][^a-zA-Z]*)+)/.*?p.*?a.*?g.*?e.*?/.*?(\d+).*?\?.*?s.*?=(.*)$}', $str, $matches)) {
        $trecho1 = preg_replace('/[^a-zA-Z]/', '', $matches[1]);
        return "{$trecho1}/page/{$matches[2]}?s={$matches[3]}";
    }
    // se não estiver no formato correto, retorna '' ou dá erro, você decide
}

With this, the cases below print animacao/page/2?s=:

echo ajusta("ani mac ao /!?pa_ge))/!@2!?s!="), "\n";
echo ajusta("a!!@*ni-ma-cao!/!?pa_ge))/!@2!?s!="), "\n";

Final considerations

Although it has been possible to resolve with regex, I do not know if it is indeed the best approach.

Your program is accepting "anything" and trying to extract a valid string from there. The problem is that there are too many possibilities to consider, and the more cases you add, the more complex - and slow - the regex gets.

If "accept anything and try to extract a valid string" is a primary requirement of the system, then there is no way, you will have to live with this regex (and give maintenance every time a new case more complicated to treat).

But if you can, maybe it’s better to have a middle ground: a regex that can validate the most common and/or basic cases, but simply fails if it’s too complicated, and then the system shows an error message and tells you which formats are accepted, for example - it is only a suggestion, because I do not know how will be your user interface or your requirements (it just seems to me too complicated to try to predict all possibilities, but of course it all depends on your use cases).

Anyway, regex is a powerful tool, useful and - in my opinion - very cool to use, but is not always the best - not the only - solution to everything.

  • I LOVE YOU, I LOVE YOU!

  • PERFECT! exactly what I wanted, but I didn’t even know where to start!

  • Testing here I ask you... in the "$matche[1]" it is not returning if you have number or even hyphen or space between words, the space can remove... Either at the beginning, between, or at the end of the parameter, the rest is working normally! You could modify man?

  • @Caiolourençon I suggest you edit the question and put the details of this case that failed, because I did not understand exactly what happened. But unfortunately I won’t be able to see that until after the holiday :-)

  • @Caiolourençon I updated the answer.

  • 1

    Dude, no problem. You know a lot! It is working perfectly, and as you explained step by step I can try to make some modifications s I need and get more information later! But yes, you got exactly what I thought was unfeasible in a single regex, I know the possibilities are many more than you are, but it’s quite complex so you understand the least with me!

  • I read your "Final Considerations" that Regex that you created is not going to be used in any way to prevent any attack on any kind of system security, on the contrary it will only help in the operation of requests by users... has somehow given explain in the chat to you?

  • @I’ve got a few minutes if you want..

  • @Caiolourençon Try this: https://chat.stackexchange.com/rooms/90639/expressionpara-tratar-url-com-parametro

Show 4 more comments

1

Based on what you put as examples, I created this commented code:

$url = "animacao/!?page))/!@2?s*!=";

// pega todas as palavras que contenham caracteres de az, AZ, 0-9, incluindo o caractere _ (sublinhado).
// OU pega a parte que tem ( ?s )
preg_match_all('/(\w|\?s)+/', $url, $partes);

// nova url que será usada
$novaUrl = "";

foreach($partes[0] as $parte){
    if(strpos($parte, "?s")){
        // se a parte a ser analisada for ( ?s ) insira na $novaUrl com um ( = ) no final
        $novaUrl .= $parte."=";
        continue; // continue o loop
    }
    // qualquer outra parte insere com uma barra no final
    $novaUrl .= $parte."/";
}

// se o último caracter for uma barra ( / ) retire ela da string
if(substr($novaUrl,  -1) == "/")
$novaUrl = substr($novaUrl, 0, -1);

// mostra o resultado
echo $novaUrl;

See it working on Ideone

  • Andrei Coelho, I have now tested its function, with basic commands it offers a good support, but try to play this string I tested in the function: "an'i-ma-c'a o/pa|ge/! @2? ssssssssssssss*!="

  • Anyway I’m taking examples of it to make other simpler!

  • @Caiolourençon I’ll make another more precise answer later

Browser other questions tagged

You are not signed in. Login or sign up in order to post.