How to get the name of the site?

Asked

Viewed 144 times

2

Imagine a scenario I own only Urls as follows, registered in my database:

https://www.google.com
https://www.facebook.com
https://www.youtube.com
https://www.twitter.com

Thinking about this case, and that there will only be Urls in this way cited, how could I elaborate a way to get the name of the site?

For example, through regex, when I invoke a particular method, and pass as value https://www.google.com, he returns me only to string Google?

3 answers

1


function nome_dominio($url)
{
  $pieces = parse_url($url);
  $domain = isset($pieces['host']) ? $pieces['host'] : '';
  if (preg_match('/(?P<domain>[a-z0-9][a-z0-9\-]{1,63}\.[a-z\.]{2,6})$/i', $domain, $regs)) {
    $nome = explode('.',$regs['domain']);
    return ucfirst($nome[0]); // converto primeira letra para maiúscula
  }
  return false;
}

// Exemplos (todos retornam Google):
echo nome_dominio("https://mail.google.com"); // Retorna Google
echo nome_dominio("https://google.com"); // Retorna Google
  • There are some problems, for example the site http://www.saopaulo.sp.gov.br, he returns sp instead of saopaulo, which is the correct name. It treats saopaulo as subdomain of the sp. The solution to this is far more complex, it cannot be simple REGEX, unless you list all the cases and put them in a REGEX.

  • Another example, http://meusite.floripa.br, returns Floripa. He’s a valid TDL, see here.

  • @Inkeliz I understand, but this code is not universal, it only serves to meet the sites in the pattern of the question, ie, website with. or site.com.br.

1

Follow an example of regex in Javascript:

var urls = [
  'https://www.google.com',
  'https://www.facebook.com',
  'https://www.youtube.com',
  'https://www.twitter.com'
];

var $saida = document.getElementById("saida");

urls.forEach(function(url) {
  var nome_site = /(https\:\/\/www.)([^.]+)(.*)/g.exec(url)[2];
  nome_site = nome_site.charAt(0).toUpperCase() + nome_site.slice(1)
  $saida.value = $saida.value + "\n" + url + ': ' + nome_site;
});
textarea {
  height: 200px;
  width: 100%;
}
<textarea id="saida"></textarea>

1

The nearest solution for all cases is much more complex than a REGEX.

Unfortunately I could not make it more friendly,the final code got a little confused, but I believe I can still understand and I will explain the whole process.


Perks (in relation to that answer)

  • It has greater support for all types of domains, such as floripa.br or adult.ht.

  • Has support for public subdominios, for example <seusite>.blogspot.com and even <seusite>.s3.amazonaws.com and the like.


Requirements:

No extension, plugin, framework is needed... Just download the public list of all domains/ TLD this is available here (https://publicsuffix.org/list/public_suffix_list.dat) and specifying the location of the file on the mentioned line.

This document shall be updated periodically.

Code:

function pegaNome($url)
{

    $url = parse_url($url, PHP_URL_HOST);
    if (empty($url)) {
        return false;
    }

    $generico = ['com', 'org', 'net', 'edu', 'gov', 'mil'];

    $lista = array_filter(file('public_suffix_list.dat.txt'));                                             // Download: https://publicsuffix.org/list/public_suffix_list.dat
    $lista = array_merge($lista, ['*']);

    $dominio = explode('.', $url);
    $dominioTamanho = count($dominio) - 1;

    $encontrado = [];

    foreach ($lista as $tld) {

        if (!in_array(substr($tld, 0, 1), ['!', '/', "\n"], true)) {

            $correto = 0;
            $partes = explode('.', $tld);
            $partesTamanho = count($partes);

            foreach ($partes as $i => $pedaco) {

                if (!isset($dominio[$dominioTamanho - $partesTamanho + $i + 1])) {
                    break;
                }

                $pedaco = (array)trim($pedaco);
                $pedaco = $pedaco === '*' ? $generico : $pedaco;

                $correto += (int)(in_array($dominio[$dominioTamanho - $partesTamanho + $i + 1], $pedaco, true));

            }

            if ($correto === $partesTamanho) {
                $encontrado[] = $correto;
            }

        }

    }

    if ($encontrado !== 0){
        rsort($encontrado);

        foreach($encontrado as $encontro){
            if(!empty($dominio[$dominioTamanho - $encontro])){
                return $dominio[$dominioTamanho - $encontro];
            }
        }

    }

    return $url;

}

Explanations:

Filing cabinet:

The file has four types of situations (ignoring blank spaces):

!tld
*.tld
// tld
tld

The code above ignores so much // tld, which are comments, as well as !tld, I don’t know the exact reason.

If it is *.tld indicates that he would be net.tld, com.tld for example, in most cases.

Checks:

When you ask to check a URL, for example https://seusite.blogspot.com is done exactly the following:

  • Uses the PHP_URL_HOST to obtain seusite.blogspot.com.
  • Divide seusite.blogspot.com for seusite, blogspot and com.

Then we need to check the domain used by your website:

  • Checks that the last element is equal to ac: com != ac
  • Checks that the last set is equal to com.ac, so that:
    • Compares the penultimate element equal to com: blogspot != com
    • Compares the last element equal to ac: com != ac

This is repeated for each line from this archive.

At a certain point will do exactly:

  • Checks that the last set is equal to blogspot.com:
    • Compares the penultimate element equal to blogpost: blogspot == blogpost
    • Compares the last element equal to com: com == com

Then you will save $encontrado[] = $correto, this will store the value 2, which is the number of parts that the "subdomain" has (.blogspot.com = 2, . net = 1, .a.b. c = 3).

In this same area, in the latest comparisons it will make:

  • Checks that the last element is equal to .com: com === com

This will also store the value 1 at the $encontrado.

Upshot:

In the end we caught the largest number of $encontrado and then we get the domain name based on it.

So if seusite.blogspot.com has the biggest $encontrado as 2 then just do $dominio[count($dominio)-2-1].

So why create an array? Why might it inform https://blogspot.com, then it would also be valid in both cases, however the count($dominio)-2-1 would then be -1. So then he passes to the next found domain, in this case .com and will return blogspot, normally.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.