Dots and accents in mod_rewrite Urls

Asked

Viewed 3,046 times

3

I noticed that many systems that use mod_rewrite (or equivalent) usually do not support Urls with accents, usually they replace the characters as dot (dot) by hyphen (-) and accents for their respective letters without accent.

My doubt is as follows, using dots and/or accents in urls may cause some kind of problem or have some reason not to use them?

This is a matter of SEO or use of navigation by the user are characters that can get in the way of something, such as navigation or structures?

My "rule" I intend to use for rewritten urls ():

^([a-zA-Z0-9_\-\/.]+)$

2 answers

3


When searching for the term rfc dot path, finally found the reason for this "polemic"

The problem with the dot in Urls - . (dot)

Not há problemas using dots in urls (even if rewritten) as for example:

http://example/project/ola.mundo.novo

or assuming we create a url falsa, as:

http://example/project/index.php/ola-mundo.html

The problem is when it occurs to use like this:

http://example/project/test./

To the server /project/test./ and /project/test/ are the same thing, but is visible that are not.

Note that the problem DOES NOT occur if you do this /project/.test/, since there are files that start with dot only.

The reason why the rewritten Urls do not use dots is to avoid this situation or to facilitate the canonization of Urls (URL normalization).

A clearer example of the problem, create a file in your physical localhost folder:

/var/www/images/test.jpg

Access http://localhost/images/test.jpg and then try to access all of these:

  • http://localhost/images/test.jpg.
  • http://localhost/images/test.jpg...
  • http://localhost/images/test.jpg....
  • http://localhost/images/test.jpg.....
  • http://localhost/images/test.jpg......
  • http://localhost/images/test.jpg.......

All Urls will be delivered to the customer (browser for example) as image test.jpg.

URL normalization (or URL canonization)

Term in English for research would be URL normalization or URL canonicalization

URL normalization (or URL canonization) is the process by which Urls are modified and standardized consistently. The purpose of the normalisation process is to transform a URL into a URL normalizado or canonical by this, it is possible to determine whether two different Urls syntactically can be equivalent.

Search engines use URL normalization in order to assign importance to web pages and reduce the indexing of duplicate pages. Crawlers perform URL normalization in order to avoid tracking the same resource more than once.

Types of normalization (the following normalizations are described by RFC 3986):

  • Removing index from directory. Indexes default directory are usually not required in Urls:

    http://www.example.com/a/index.htmlhttp://www.example.com/a/

  • Replacing IP with a domain name. Make sure the IP address maps to a canonical domain name:

    http://208.77.188.166/http://www.example.com/ (something that helps in this is the header Host: domain)

  • Removing duplicate cuts paths that include two adjacent bars can be converted to one:

    http://www.example.com/foo//bar.htmlhttp://www.example.com/foo/bar.html

  • Removing or adding www as the first domain label. Usually both urls point to the same pages:

    http://www.example.com/http://example.com/

  • Removal of the ? when the query is empty. When the query is empty, there may be no need for the ?:

    http://www.example.com/display?http://www.example.com/display

  • Adds / to directories:

    http://www.example.com/alicehttp://www.example.com/alice/ (generally servers like Apache and Ngnix already redirect if it is a real folder).

    However, there is no way to know whether a path URL component represents a directory or not. RFC 3986 mentions that if the URL redirects to the last URL of the example, then this is an indication that they are equivalent.

  • Removing point tracings (dot-segments). Segments .. and . Can be removed from a URL according to the algorithm described in RFC 3986:

    http://www.example.com/../a/b/../c/./d.htmlhttp://www.example.com/a/c/d.html

    However, if a component .. was removed, ex: b/.., is a symbolic link to a directory with a different parent, this b/.. will result in a different path and URL. In rare cases, depending on the web server, this may even be true for the root directory (e. g. //www.example.com/.. may not be equivalent to //www.example.com/. (this is the probable reason to avoid .)

So you ask me: William I must then avoid the dots on my rewritten Urls? I say it’s a solution, but it’s not the only one, if you’re using mod_rewrite, probably be using a language like PHP for example and through this language you can detect if the URL has points at the end, for example:

<IfModule mod_rewrite.c>
    RewriteEngine On

    RewriteCond %{REQUEST_FILENAME} !-f
    RewriteCond %{REQUEST_FILENAME} !-d

    RewriteRule ^([a-zA-Z0-9\-\/.]+)$ index.php/$1 [QSA,L]
</IfModule>

This Rewriterule generates the variable $_SERVER['PATH_INFO'] and you can compare this variable with the variable $_SERVER['REQUEST_URI'], both will be different. Or else you can just use REQUEST_URI combined with rtrim to verify and make a permanent redirect, for example:

<?php
$req = rtrim($_SERVER['REQUEST_URI'], '/');//Remove barra do final

if ($req !== rtrim($req, '.')) {
    //Usei X-PHP-Response-Code para compatibilidade com alguns servidores Fast-CGI ou semelhantes
    header('X-PHP-Response-Code: 301', true, 301);
}

I believe you can do it for .htaccess also, as soon as I can produce something efficient I will edit the answer.

The problem of accents in Urls

The accents have as their main reason to be avoided because of canonization, but it is not the mesmo problem of the point . (dot), the problem is due to the characters equivalentes however diferentes, for example:

  • In a PHP document saved in ANSI á will be coded as %E1:

      <?php
      echo 'http://example/', urlencode('á-é-í');//Output: http://example/%E1-%E9-%ED
    
  • In a PHP document saved in UTF-8 á will be coded as %C3%A1:

      <?php
      echo 'http://example/', urlencode('á-é-í');//Output: http://example/%C3%A1-%C3%A9-%C3%AD
    

This is just one example, another would be ß and ss

There are solutions to avoid this problem and not use accents is one of them, but there are other ways as soon as possible I will provide an example.

Unicode canonization

Note: although some points are the same as those described here, the RFC 2279

In Unicode, many accented letters can be represented in more ways. For example, and can be represented in Unicode as the Unicode character U+0065 (LATIN SMALL LETTER E) followed by character U+0301 (which combines acute accent), but can also be represented as the pre-compressed character U+00E9 (LATIN SMALL AND WITH SHARP LETTER). This makes sequence comparison more complicated, since every possible representation of a string containing such glyphs must be considered. To deal with this, Unicode provides the canonical equivalence mechanism. In this context, canonization is Unicode normalization.

Variable length encodings in the Unicode standard, in particular, UTF-8, may cause an additional need for canonicalization in some situations. That is, by the norm, in UTF-8 there is only one byte sequence valid for any Unicode character, but some byte sequences are invalid, that is, it cannot be obtained by encoding any Unicode string in UTF-8. Some sloppy decoder implementations can accept invalid byte sequences as input and produce a valid Unicode character as output for such sequence. If someone uses a decoder, some Unicode characters actually have more than a corresponding byte sequence: one valid and some invalid. This can lead to security problems similar to those described in the previous section. Therefore, if someone wants to apply some filters (for example, a regular expression written in UTF-8) for UTF-8 strings that will later be passed to a decoder that allows invalid byte sequences, you must canonize the strings before passing them to the filter. In this context, canonization is the process of translating each sequence character into its valid single byte sequence. An alternative to canonization is to reject any sequences containing invalid byte sequences.

Sources:

  • @Leocaracciolo already gave to understand that you do not know understand a constructive criticism and does not accept well criticism, several times I tried to help you when you received downvotes from other users, to reverse the situation and you interpreted my help as negative criticism only, sincerely rest assured I will not bother you anymore, I will let you think what you want from the downvotes you receive from the community, I no longer get into your answers and I will not help anymore, Because if all this time you don’t understand how it works, it means you’ll never understand. Farewell.

2

Points and special characters can interfere when receiving data in your programming, ideally using a method urlencode or urldecode, this exists for various web languages or similar, to be able to manipulate these variables without giving error or losing fields.

  • Thank you for the answer, how could they interfere (both the point and the accents)? Could you give me an example? Grateful. Note: I don’t think that’s necessary urldecode because by delivering the answer via GET to the server they are already decoded.

  • urldecode will be needed when using urlencode before.

  • The need for urlencode, will be so that when creating a link dynamically not run the risk of passing spaces, points and bar mainly, these will surely break your link. Accents and other special characters may vary according to the charset used, intervention between different languages and other characteristics.

  • Michel I understand the use of urlencode, urldecode, etc. But the generated links should not contain accents or spaces, so urlencode will not serve, actually I created a function using it got like this $text = iconv($encode, 'ASCII//TRANSLIT//IGNORE', $text); return trim($text, '-'); and turns a string like this -á é í ó . oi Olá Mundo- in a-e-i-o-.-oi-Ola-mundo. About charset and coding I’ve been able to solve, I just need to define if using accents and points can cause some other problems that are not from a link, but on the issue of the request, because I noticed that...

  • ...many sites avoid the use of dots and accents, you could help me?

  • You can even register domains with an accent. Registro.br accepts different registration for www.requisição.com.br and www.requisicao.com.br, hj the browser interprets quietly, the question is more how you receive the request. Many websites use frameworks or tools, such as wordpress, joomla, blogger, among development components in all languages, already offer ready-made solutions, which by convention manipulate the url without accents.

  • Michel thanks for the effort, when I referred to the sites, I referred to their pages and not domains. After searching in RFC I found the reason for this and formulated a reply I hope you read, thank you.

Show 2 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.