The nearest solution for all cases is much more complex than a REGEX.
Unfortunately I could not make it more friendly,the final code got a little confused, but I believe I can still understand and I will explain the whole process.
Perks (in relation to that answer)
It has greater support for all types of domains, such as floripa.br
or adult.ht
.
Has support for public subdominios, for example <seusite>.blogspot.com
and even <seusite>.s3.amazonaws.com
and the like.
Requirements:
No extension, plugin, framework is needed... Just download the public list of all domains/ TLD this is available here (https://publicsuffix.org/list/public_suffix_list.dat) and specifying the location of the file on the mentioned line.
This document shall be updated periodically.
Code:
function pegaNome($url)
{
$url = parse_url($url, PHP_URL_HOST);
if (empty($url)) {
return false;
}
$generico = ['com', 'org', 'net', 'edu', 'gov', 'mil'];
$lista = array_filter(file('public_suffix_list.dat.txt')); // Download: https://publicsuffix.org/list/public_suffix_list.dat
$lista = array_merge($lista, ['*']);
$dominio = explode('.', $url);
$dominioTamanho = count($dominio) - 1;
$encontrado = [];
foreach ($lista as $tld) {
if (!in_array(substr($tld, 0, 1), ['!', '/', "\n"], true)) {
$correto = 0;
$partes = explode('.', $tld);
$partesTamanho = count($partes);
foreach ($partes as $i => $pedaco) {
if (!isset($dominio[$dominioTamanho - $partesTamanho + $i + 1])) {
break;
}
$pedaco = (array)trim($pedaco);
$pedaco = $pedaco === '*' ? $generico : $pedaco;
$correto += (int)(in_array($dominio[$dominioTamanho - $partesTamanho + $i + 1], $pedaco, true));
}
if ($correto === $partesTamanho) {
$encontrado[] = $correto;
}
}
}
if ($encontrado !== 0){
rsort($encontrado);
foreach($encontrado as $encontro){
if(!empty($dominio[$dominioTamanho - $encontro])){
return $dominio[$dominioTamanho - $encontro];
}
}
}
return $url;
}
Explanations:
Filing cabinet:
The file has four types of situations (ignoring blank spaces):
!tld
*.tld
// tld
tld
The code above ignores so much // tld
, which are comments, as well as !tld
, I don’t know the exact reason.
If it is *.tld
indicates that he would be net.tld
, com.tld
for example, in most cases.
Checks:
When you ask to check a URL, for example https://seusite.blogspot.com
is done exactly the following:
- Uses the
PHP_URL_HOST
to obtain seusite.blogspot.com
.
- Divide
seusite.blogspot.com
for seusite
, blogspot
and com
.
Then we need to check the domain used by your website:
- Checks that the last element is equal to
ac
: com
!= ac
- Checks that the last set is equal to
com.ac
, so that:
- Compares the penultimate element equal to
com
: blogspot
!= com
- Compares the last element equal to
ac
: com
!= ac
This is repeated for each line from this archive.
At a certain point will do exactly:
- Checks that the last set is equal to
blogspot.com
:
- Compares the penultimate element equal to
blogpost
: blogspot
== blogpost
- Compares the last element equal to
com
: com
== com
Then you will save $encontrado[] = $correto
, this will store the value 2
, which is the number of parts that the "subdomain" has (.blogspot.com = 2, . net = 1, .a.b. c = 3).
In this same area, in the latest comparisons it will make:
- Checks that the last element is equal to
.com
: com
=== com
This will also store the value 1
at the $encontrado
.
Upshot:
In the end we caught the largest number of $encontrado
and then we get the domain name based on it.
So if seusite.blogspot.com
has the biggest $encontrado
as 2
then just do $dominio[count($dominio)-2-1]
.
So why create an array? Why might it inform https://blogspot.com
, then it would also be valid in both cases, however the count($dominio)-2-1
would then be -1
. So then he passes to the next found domain, in this case .com
and will return blogspot
, normally.
There are some problems, for example the site
http://www.saopaulo.sp.gov.br
, he returnssp
instead ofsaopaulo
, which is the correct name. It treats saopaulo as subdomain of thesp
. The solution to this is far more complex, it cannot be simple REGEX, unless you list all the cases and put them in a REGEX.– Inkeliz
Another example,
http://meusite.floripa.br
, returnsFloripa
. He’s a valid TDL, see here.– Inkeliz
@Inkeliz I understand, but this code is not universal, it only serves to meet the sites in the pattern of the question, ie, website with. or site.com.br.
– Sam