How do I get compound domain names?

Asked

Viewed 131 times

2

I saw How to get the site name? and testing saw that works very well in simple names, but in compound names, for example:

https://www.stackoverflow.com
https://www.oficinacarlos.com
https://www.lucasverduras.com

It returns all together, so:

Stackoverflow

Officinacarlos

Lucasvegetables

There would be a way to receive the compound names, like the ones above and return them like this:

Stack Overflow

Oficina Carlos

Lucas Verduras

I’m using the following code:

function nome_dominio($url)
{
    $pieces = parse_url($url);
    $domain = isset($pieces['host']) ? $pieces['host'] : '';
      if (preg_match('/(?P<domain>[a-z0-9][a-z0-9\-]{1,63}\.[a-z\.]{2,6})$/i', $domain, $regs)) {
        $nome = explode('.',$regs['domain']);
        return ucfirst($nome[0]); // converto primeira letra para maiúscula
      }
    return false;
}

It is necessary that the function returns both compound names and simple names.

  • 1

    I find it difficult, because what would be the criteria for example for Stackoverflow return Stack Overflow? Only then could we get a lot of returns Stac Koverflow, Stackover Flow and so on. Impossible is not, just have a database with all the domain names and names of the sites and return the name of the site by searching for the domain. If the universe of urls is limited you can do with array See https://ideone.com/HL1NkN

  • 2

    I think the easiest way is to simply access the page (e.g. Curl or gets_contents) and get the information from <title> and check what approaches with the domain, so if the site name is Stack Overflow em Portugues, sets to Stack Overflow. I think the only way.

  • 1

    @Inkeliz, I think an example of what you said as an answer could bring an excellent answer to that question.

  • 2

    @Uzumakiartanis, I’m doing this. D

2 answers

2

It is difficult to create something that works in all cases, I tried to make it as simple as possible, but in several cases it has errors somewhat grotesque.

Testing:

Alexa’s TOP 30 score:

+----------------+-----------+
|    Dominio     |   Nome    |
+----------------+-----------+
| youtube.com    | YouTube   |
| facebook.com   | Facebook  |
| baidu.com      | Baidu     |
| wikipedia.org  | Wikipedia |
| yahoo.com      | Yahoo     |
| reddit.com     | reddit    |
| google.co.in   | Google    |
| qq.com         | Qq**      |
| amazon.com     | Amazon    |
| taobao.com     | Taobao    |
| google.co.jp   | Google    |
| twitter.com    | Twitter   |
| tmall.com      | Tmall**   |
| vk.com         | VK        |
| live.com       | Live      |
| instagram.com  | Instagram |
| sohu.com       | Sohu      |
| sina.com.cn    | Sina      |
| weibo.com      | Weibo**   |
| jd.com         | JD        |
| 360.cn         | 360       |
| google.de      | Google    |
| google.co.uk   | Google    |
| google.ru      | Google    |
| google.fr      | Google    |
| linkedin.com   | LinkedIn  |
| google.com  | Google    |
| list.tmall.com | Tmall**   |
| google.com.hk  | Google    |
| yandex.ru      | Yandex    |
+----------------+-----------+

Already between the 199992 until 200026 of Alexa:

+----------------------------+--------------------------------------------+
|          Dominio           |                  Nome                      |
+----------------------------+--------------------------------------------+
| gsm-specs.com              | GSM-specs.com - GSM-specs***               |
| cikm2017.org               | CIKM 2017                                  |
| sitkagear.com              | SITKA Gear | Turning Clothing Into Gear*** |
| laprocure.com              | La Procure                                 |
| pori.fi                    | Pori                                       |
| 1213wz.com                 | 1213wz                                     |
| unistar.by                 | Unistar                                    |
| upskirtjerk.com            | Upskirt Jerk                               |
| astarehsaghf.com           | Astarehsaghf*                              |
| dornc.com                  | Department of***                           |
| serviceacademyforums.com   | Service Academy Forums                     |
| yaledailynews.com          | Yale Daily News                            |
| rewardingexcellence.com    | rformance Ce***                            |
| lokosom.com.br             | Lokosom                                    |
| i-escape.com               | i-escape                                   |
| 90rss.com                  | 90rss                                      |
| bhdstar.vn                 | BHD STAR                                   |
| le-onze-parisien.fr        | Le Onze Parisien                           |
| criarweb.com               | CriarWeb                                   |
| fundayshop.com             | Fundayshop                                 |
| campsitephotos.com         | CampsitePhotos                             |
| spankwirefreehd.com        | Spankwirefreehd                            |
| kabudragon.com             | Kabudragon**                               |
| rebug.me                   | REBUG                                      |
| yuchaoyang.com             | Yuchaoyang*                                |
| naval.com.br               | NAVAL                                      |
| chesterfield.gov           | Chesterfield*                              |
| nururi.com                 | Nururi                                     |
| vcegdaprazdnik.ru          | Vcegdaprazdnik**                           |
| noridianmedicareportal.com | Noridianmedicareportal*                    |
| solobari.it                | Solobari                                   |
| kaddr.com                  | Kaddr                                      |
| mayoclinichealthsystem.org | Mayo Clinic Health System                  |
| sanayi.gov.tr              | Sanayi                                     |
+----------------------------+--------------------------------------------+

Already between 390000 and 390029 of Alexa:

+---------------------------+---------------------------------------------------------------------+
|    catholicplanet.com     |                           Catholic Planet                           |
+---------------------------+---------------------------------------------------------------------+
| 4jovem.com                | 4jovem                                                              |
| uploadmb.com              | UploadMB                                                            |
| 2bet.ag                   | 2Bet                                                                |
| polnakorzina.ru           | Polnakorzina**                                                      |
| kktown.com.tw             | KKTOWN                                                              |
| pension.de                | Pensionen, Ferienunterkünfte & Ferienwohnungen finden - Pension*** |
| realresultslist.com       | realresultslist*                                                    |
| hoya.co.jp                | HOYA                                                                |
| fbw.jp                    | Fbw**                                                               |
| mongol-media.com          | Mongol-Media                                                        |
| indianpediatrics.net      | Indian Pediatrics                                                   |
| dmmfree.net               | DmmFree                                                             |
| mp3gui.info               | Mp3Gui                                                              |
| xhtmlforum.de             | XHTMLforum                                                          |
| whole9life.com            | Whole9 - Let us change your life***                                 |
| swidnica.pl               | Swidnica                                                            |
| revbrew.com               | rewery | Revolution Brew***                                         |
| nasleshahvar.ir           | Nasleshahvar                                                        |
| com-private.club          | Com-private                                                         |
| crack4patch.com           | Crack 4 Patch                                                       |
| incomingsoft.de           | Incomingsoft*                                                       |
| thefrustratedengineer.com | The Frustrated Engineer                                             |
| forumdesimages.fr         | Forum des images                                                    |
| tripvillas.com            | Tripvillas                                                          |
| araxis.com                | Araxis                                                              |
| rembetiko.gr              | Rembetiko                                                           |
| krasview.ru               | Krasview                                                            |
| duckokong.com             | Duckokong*                                                          |
| hotesextubes.com          | Hot Sex Tubes                                                       |
+---------------------------+---------------------------------------------------------------------+

Result of the mentioned links:

+----------------------------+-----------------------------------------+
|          Dominio           |                  Nome                   |
+----------------------------+-----------------------------------------+
| stackoverflow.com          | Stack Overflow                          |
+----------------------------+-----------------------------------------+

Main problems:

  1. The website must be available so that it works minimally and accessible by Url, no redirections done by javascript for example, target the cases indicated with *.

  2. "Asian"/"Russian" websites have major problems, see **.

  3. Due to the method of operation, obtaining the beginning and end can be that take a stretch much larger than the title itself or much smaller, target those marked with ***. This can be fixed trying to find string closer, but I did nothing to fix it.


How it works?

function colidirTituloComNome($title, $name){

    $inicio = encontrarInicio($title, $name);
    $fim = encontrarFim($title, $name);

    if ($inicio !== false && $fim !== false){
        return mb_substr($title, $inicio, $fim - $inicio, 'UTF-8');
    }

    return ucfirst($name);
}

function encontrarInicio($title, $name){

    $achado = mb_stripos($title, $name, 0, 'UTF-8');
    if ($achado !== false){
        return $achado;
    }

    if (mb_strlen($name, 'UTF-8') <= 1) {
        return false;
    }

    return encontrarInicio($title, mb_substr($name, 0, ceil(mb_strlen($name, 'UTF-8')/2), 'UTF-8'));
}

function encontrarFim($title, $name){

    $achado = mb_strripos($title, $name, 0, 'UTF-8');
    if ($achado !== false){
        return $achado + mb_strlen($name, 'UTF-8');
    }

    if (mb_strlen($name, 'UTF-8') <= 1) {
        return false;
    }

    return encontrarFim($title, mb_substr($name, ceil(mb_strlen($name, 'UTF-8')/2), null, 'UTF-8'));
}

It’s "half" duplicated, but that’s it. The idea is that given an input stackoverflow and another Stack Overflow em Portugues will try to cut the string to the point where you find "Stack" and also find "flow", so you can get "Stack Overflow".

There are several other ways to do this, perhaps others much more precise and efficient, for example the similar_text or levenshtein.

If I didn’t find it it would return "Stackoverflow".


To obtain the value of <title> can use:

function pegaTitulo($url)
{
    $ch = curl_init($url);

    curl_setopt_array($ch, [
            CURLOPT_RETURNTRANSFER => 1,
            CURLOPT_USERAGENT => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.0.0 Safari/537.36',
            CURLOPT_REDIR_PROTOCOLS => CURLPROTO_HTTP | CURLPROTO_HTTPS,
            CURLOPT_SSLVERSION => CURL_SSLVERSION_TLSv1_2,
            CURLOPT_FOLLOWLOCATION => 1,
            CURLOPT_MAXREDIRS => 2,
            CURLOPT_SSL_VERIFYPEER => 1,
            CURLOPT_SSL_VERIFYHOST => 2,
            CURLOPT_TIMEOUT => 10,                                                             // Timeout
            CURLOPT_CONNECTTIMEOUT => 2,                                                       // Timeout
            CURLOPT_FAILONERROR => 1,
            CURLOPT_CAINFO => __DIR__ . DIRECTORY_SEPARATOR . 'cacert-2017-06-07.pem',         // Download: https://curl.haxx.se/ca/cacert-2017-06-07.pem
        ]
    );

    if ($html = curl_exec($ch)) {

        libxml_use_internal_errors(true);
        $dom = new DOMDocument();

        if ($dom->loadHTML($html)) {
            $list = $dom->getElementsByTagName("title");
            if ($list->length > 0) {
                return $list->item(0)->textContent;
            }
        }
    }

    return false;
}

The Curl will get the page information, it is limited to HTTP/HTTPS and can follow up to 2 redirects. In addition it will check SSL and has a timeout to fail in case it takes too long. This is minimally safe for public use, where the user will be able to define the $url.

If all goes well, it will get the contents of the tag <title> using the DOMDocument.

To get the name (https://answall.com for stackoverflow) can use this other function.

Then you can use:

$nome = pegaNome($url);
$titulo = pegaTitulo($url);

if ($nome && $titulo) {
    echo htmlentities(colidirTituloComNome($titulo, $nome));
}
  • I think that answer back for answer back does not matter. Your answer is huge for something "simple" (if it is possible to do what the owner of the question wants). And yet, his response leaves several unknown as to the success of what is intended. I find it invalid to answer gambiarra (with all due respect) to resolve partially what is intended. I think should post the answer that meets the question, or else not answer anything.

1

This is not possible in an easy, native or automated way because to create this type of algorithm you need to define patterns for the code to follow. And since it’s a given name, the amount of possible patterns are impractical to predict and analyze.

  • I agree. It is not possible to do what is intended without a reference.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.