When searching for the term rfc dot path, finally found the reason for this "polemic"
The problem with the dot in Urls - .
(dot)
Not há problemas
using dots in urls (even if rewritten) as for example:
http://example/project/ola.mundo.novo
or assuming we create a url falsa
, as:
http://example/project/index.php/ola-mundo.html
The problem is when it occurs to use like this:
http://example/project/test./
To the server /project/test./
and /project/test/
are the same thing, but is visible that are not.
Note that the problem DOES NOT occur if you do this /project/.test/
, since there are files that start with dot only.
The reason why the rewritten Urls do not use dots is to avoid this situation or to facilitate the canonization of Urls (URL normalization).
A clearer example of the problem, create a file in your physical localhost folder:
/var/www/images/test.jpg
Access http://localhost/images/test.jpg
and then try to access all of these:
http://localhost/images/test.jpg.
http://localhost/images/test.jpg...
http://localhost/images/test.jpg....
http://localhost/images/test.jpg.....
http://localhost/images/test.jpg......
http://localhost/images/test.jpg.......
All Urls will be delivered to the customer (browser for example) as image test.jpg
.
URL normalization (or URL canonization)
Term in English for research would be URL normalization
or URL canonicalization
URL normalization (or URL canonization) is the process by which Urls are modified and standardized consistently. The purpose of the normalisation process is to transform a URL into a URL normalizado
or canonical by this, it is possible to determine whether two different Urls syntactically can be equivalent.
Search engines use URL normalization in order to assign importance to web pages and reduce the indexing of duplicate pages. Crawlers perform URL normalization in order to avoid tracking the same resource more than once.
Types of normalization (the following normalizations are described by RFC 3986):
Removing index from directory. Indexes default directory are usually not required in Urls:
http://www.example.com/a/index.html
→ http://www.example.com/a/
Replacing IP with a domain name. Make sure the IP address maps to a canonical domain name:
http://208.77.188.166/
→ http://www.example.com/
(something that helps in this is the header Host: domain
)
Removing duplicate cuts paths that include two adjacent bars can be converted to one:
http://www.example.com/foo//bar.html
→ http://www.example.com/foo/bar.html
Removing or adding www
as the first domain label. Usually both urls point to the same pages:
http://www.example.com/
→ http://example.com/
Removal of the ?
when the query is empty. When the query is empty, there may be no need for the ?
:
http://www.example.com/display?
→ http://www.example.com/display
Adds /
to directories:
http://www.example.com/alice
→ http://www.example.com/alice/
(generally servers like Apache and Ngnix already redirect if it is a real folder).
However, there is no way to know whether a path URL component represents a directory or not. RFC 3986 mentions that if the URL redirects to the last URL of the example, then this is an indication that they are equivalent.
Removing point tracings (dot-segments). Segments ..
and .
Can be removed from a URL according to the algorithm described in RFC 3986:
http://www.example.com/../a/b/../c/./d.html
→ http://www.example.com/a/c/d.html
However, if a component ..
was removed, ex: b/..
, is a symbolic link to a directory with a different parent, this b/..
will result in a different path and URL. In rare cases, depending on the web server, this may even be true for the root directory (e. g. //www.example.com/..
may not be equivalent to //www.example.com/
. (this is the probable reason to avoid .
)
So you ask me: William I must then avoid the dots on my rewritten Urls?
I say it’s a solution, but it’s not the only one, if you’re using mod_rewrite
, probably be using a language like PHP for example and through this language you can detect if the URL has points at the end, for example:
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^([a-zA-Z0-9\-\/.]+)$ index.php/$1 [QSA,L]
</IfModule>
This Rewriterule generates the variable $_SERVER['PATH_INFO']
and you can compare this variable with the variable $_SERVER['REQUEST_URI']
, both will be different. Or else you can just use REQUEST_URI
combined with rtrim
to verify and make a permanent redirect, for example:
<?php
$req = rtrim($_SERVER['REQUEST_URI'], '/');//Remove barra do final
if ($req !== rtrim($req, '.')) {
//Usei X-PHP-Response-Code para compatibilidade com alguns servidores Fast-CGI ou semelhantes
header('X-PHP-Response-Code: 301', true, 301);
}
I believe you can do it for .htaccess
also, as soon as I can produce something efficient I will edit the answer.
The problem of accents in Urls
The accents have as their main reason to be avoided because of canonization, but it is not the mesmo
problem of the point .
(dot), the problem is due to the characters equivalentes
however diferentes
, for example:
In a PHP document saved in ANSI á
will be coded as %E1
:
<?php
echo 'http://example/', urlencode('á-é-í');//Output: http://example/%E1-%E9-%ED
In a PHP document saved in UTF-8 á
will be coded as %C3%A1
:
<?php
echo 'http://example/', urlencode('á-é-í');//Output: http://example/%C3%A1-%C3%A9-%C3%AD
This is just one example, another would be ß
and ss
There are solutions to avoid this problem and not use accents is one of them, but there are other ways as soon as possible I will provide an example.
Unicode canonization
Note: although some points are the same as those described here, the RFC 2279
In Unicode, many accented letters can be represented in more ways. For example, and can be represented in Unicode as the Unicode character U+0065
(LATIN SMALL LETTER E) followed by character U+0301
(which combines acute accent), but can also be represented as the pre-compressed character U+00E9
(LATIN SMALL AND WITH SHARP LETTER). This makes sequence comparison more complicated, since every possible representation of a string containing such glyphs must be considered. To deal with this, Unicode provides the canonical equivalence mechanism. In this context, canonization is Unicode normalization.
Variable length encodings in the Unicode standard, in particular, UTF-8
, may cause an additional need for canonicalization in some situations. That is, by the norm, in UTF-8
there is only one byte sequence valid for any Unicode character, but some byte sequences are invalid, that is, it cannot be obtained by encoding any Unicode string in UTF-8
. Some sloppy decoder implementations can accept invalid byte sequences as input and produce a valid Unicode character as output for such sequence. If someone uses a decoder, some Unicode characters actually have more than a corresponding byte sequence: one valid and some invalid. This can lead to security problems similar to those described in the previous section. Therefore, if someone wants to apply some filters (for example, a regular expression written in UTF-8
) for UTF-8
strings that will later be passed to a decoder that allows invalid byte sequences, you must canonize the strings before passing them to the filter. In this context, canonization is the process of translating each sequence character into its valid single byte sequence. An alternative to canonization is to reject any sequences containing invalid byte sequences.
Sources:
@Leocaracciolo already gave to understand that you do not know understand a constructive criticism and does not accept well criticism, several times I tried to help you when you received downvotes from other users, to reverse the situation and you interpreted my help as negative criticism only, sincerely rest assured I will not bother you anymore, I will let you think what you want from the downvotes you receive from the community, I no longer get into your answers and I will not help anymore, Because if all this time you don’t understand how it works, it means you’ll never understand. Farewell.
– Guilherme Nascimento