It is difficult to create something that works in all cases, I tried to make it as simple as possible, but in several cases it has errors somewhat grotesque.
Testing:
Alexa’s TOP 30 score:
+----------------+-----------+
| Dominio | Nome |
+----------------+-----------+
| youtube.com | YouTube |
| facebook.com | Facebook |
| baidu.com | Baidu |
| wikipedia.org | Wikipedia |
| yahoo.com | Yahoo |
| reddit.com | reddit |
| google.co.in | Google |
| qq.com | Qq** |
| amazon.com | Amazon |
| taobao.com | Taobao |
| google.co.jp | Google |
| twitter.com | Twitter |
| tmall.com | Tmall** |
| vk.com | VK |
| live.com | Live |
| instagram.com | Instagram |
| sohu.com | Sohu |
| sina.com.cn | Sina |
| weibo.com | Weibo** |
| jd.com | JD |
| 360.cn | 360 |
| google.de | Google |
| google.co.uk | Google |
| google.ru | Google |
| google.fr | Google |
| linkedin.com | LinkedIn |
| google.com | Google |
| list.tmall.com | Tmall** |
| google.com.hk | Google |
| yandex.ru | Yandex |
+----------------+-----------+
Already between the 199992 until 200026 of Alexa:
+----------------------------+--------------------------------------------+
| Dominio | Nome |
+----------------------------+--------------------------------------------+
| gsm-specs.com | GSM-specs.com - GSM-specs*** |
| cikm2017.org | CIKM 2017 |
| sitkagear.com | SITKA Gear | Turning Clothing Into Gear*** |
| laprocure.com | La Procure |
| pori.fi | Pori |
| 1213wz.com | 1213wz |
| unistar.by | Unistar |
| upskirtjerk.com | Upskirt Jerk |
| astarehsaghf.com | Astarehsaghf* |
| dornc.com | Department of*** |
| serviceacademyforums.com | Service Academy Forums |
| yaledailynews.com | Yale Daily News |
| rewardingexcellence.com | rformance Ce*** |
| lokosom.com.br | Lokosom |
| i-escape.com | i-escape |
| 90rss.com | 90rss |
| bhdstar.vn | BHD STAR |
| le-onze-parisien.fr | Le Onze Parisien |
| criarweb.com | CriarWeb |
| fundayshop.com | Fundayshop |
| campsitephotos.com | CampsitePhotos |
| spankwirefreehd.com | Spankwirefreehd |
| kabudragon.com | Kabudragon** |
| rebug.me | REBUG |
| yuchaoyang.com | Yuchaoyang* |
| naval.com.br | NAVAL |
| chesterfield.gov | Chesterfield* |
| nururi.com | Nururi |
| vcegdaprazdnik.ru | Vcegdaprazdnik** |
| noridianmedicareportal.com | Noridianmedicareportal* |
| solobari.it | Solobari |
| kaddr.com | Kaddr |
| mayoclinichealthsystem.org | Mayo Clinic Health System |
| sanayi.gov.tr | Sanayi |
+----------------------------+--------------------------------------------+
Already between 390000 and 390029 of Alexa:
+---------------------------+---------------------------------------------------------------------+
| catholicplanet.com | Catholic Planet |
+---------------------------+---------------------------------------------------------------------+
| 4jovem.com | 4jovem |
| uploadmb.com | UploadMB |
| 2bet.ag | 2Bet |
| polnakorzina.ru | Polnakorzina** |
| kktown.com.tw | KKTOWN |
| pension.de | Pensionen, Ferienunterkünfte & Ferienwohnungen finden - Pension*** |
| realresultslist.com | realresultslist* |
| hoya.co.jp | HOYA |
| fbw.jp | Fbw** |
| mongol-media.com | Mongol-Media |
| indianpediatrics.net | Indian Pediatrics |
| dmmfree.net | DmmFree |
| mp3gui.info | Mp3Gui |
| xhtmlforum.de | XHTMLforum |
| whole9life.com | Whole9 - Let us change your life*** |
| swidnica.pl | Swidnica |
| revbrew.com | rewery | Revolution Brew*** |
| nasleshahvar.ir | Nasleshahvar |
| com-private.club | Com-private |
| crack4patch.com | Crack 4 Patch |
| incomingsoft.de | Incomingsoft* |
| thefrustratedengineer.com | The Frustrated Engineer |
| forumdesimages.fr | Forum des images |
| tripvillas.com | Tripvillas |
| araxis.com | Araxis |
| rembetiko.gr | Rembetiko |
| krasview.ru | Krasview |
| duckokong.com | Duckokong* |
| hotesextubes.com | Hot Sex Tubes |
+---------------------------+---------------------------------------------------------------------+
Result of the mentioned links:
+----------------------------+-----------------------------------------+
| Dominio | Nome |
+----------------------------+-----------------------------------------+
| stackoverflow.com | Stack Overflow |
+----------------------------+-----------------------------------------+
Main problems:
The website must be available so that it works minimally and accessible by Url, no redirections done by javascript for example, target the cases indicated with *
.
"Asian"/"Russian" websites have major problems, see **
.
Due to the method of operation, obtaining the beginning and end can be that take a stretch much larger than the title itself or much smaller, target those marked with ***
. This can be fixed trying to find string closer, but I did nothing to fix it.
How it works?
function colidirTituloComNome($title, $name){
$inicio = encontrarInicio($title, $name);
$fim = encontrarFim($title, $name);
if ($inicio !== false && $fim !== false){
return mb_substr($title, $inicio, $fim - $inicio, 'UTF-8');
}
return ucfirst($name);
}
function encontrarInicio($title, $name){
$achado = mb_stripos($title, $name, 0, 'UTF-8');
if ($achado !== false){
return $achado;
}
if (mb_strlen($name, 'UTF-8') <= 1) {
return false;
}
return encontrarInicio($title, mb_substr($name, 0, ceil(mb_strlen($name, 'UTF-8')/2), 'UTF-8'));
}
function encontrarFim($title, $name){
$achado = mb_strripos($title, $name, 0, 'UTF-8');
if ($achado !== false){
return $achado + mb_strlen($name, 'UTF-8');
}
if (mb_strlen($name, 'UTF-8') <= 1) {
return false;
}
return encontrarFim($title, mb_substr($name, ceil(mb_strlen($name, 'UTF-8')/2), null, 'UTF-8'));
}
It’s "half" duplicated, but that’s it. The idea is that given an input stackoverflow
and another Stack Overflow em Portugues
will try to cut the string to the point where you find "Stack" and also find "flow", so you can get "Stack Overflow".
There are several other ways to do this, perhaps others much more precise and efficient, for example the similar_text
or levenshtein
.
If I didn’t find it it would return "Stackoverflow".
To obtain the value of <title>
can use:
function pegaTitulo($url)
{
$ch = curl_init($url);
curl_setopt_array($ch, [
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_USERAGENT => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.0.0 Safari/537.36',
CURLOPT_REDIR_PROTOCOLS => CURLPROTO_HTTP | CURLPROTO_HTTPS,
CURLOPT_SSLVERSION => CURL_SSLVERSION_TLSv1_2,
CURLOPT_FOLLOWLOCATION => 1,
CURLOPT_MAXREDIRS => 2,
CURLOPT_SSL_VERIFYPEER => 1,
CURLOPT_SSL_VERIFYHOST => 2,
CURLOPT_TIMEOUT => 10, // Timeout
CURLOPT_CONNECTTIMEOUT => 2, // Timeout
CURLOPT_FAILONERROR => 1,
CURLOPT_CAINFO => __DIR__ . DIRECTORY_SEPARATOR . 'cacert-2017-06-07.pem', // Download: https://curl.haxx.se/ca/cacert-2017-06-07.pem
]
);
if ($html = curl_exec($ch)) {
libxml_use_internal_errors(true);
$dom = new DOMDocument();
if ($dom->loadHTML($html)) {
$list = $dom->getElementsByTagName("title");
if ($list->length > 0) {
return $list->item(0)->textContent;
}
}
}
return false;
}
The Curl will get the page information, it is limited to HTTP/HTTPS and can follow up to 2 redirects. In addition it will check SSL and has a timeout to fail in case it takes too long. This is minimally safe for public use, where the user will be able to define the $url
.
If all goes well, it will get the contents of the tag <title>
using the DOMDocument
.
To get the name (https://answall.com
for stackoverflow
) can use this other function.
Then you can use:
$nome = pegaNome($url);
$titulo = pegaTitulo($url);
if ($nome && $titulo) {
echo htmlentities(colidirTituloComNome($titulo, $nome));
}
I find it difficult, because what would be the criteria for example for Stackoverflow return Stack Overflow? Only then could we get a lot of returns Stac Koverflow, Stackover Flow and so on. Impossible is not, just have a database with all the domain names and names of the sites and return the name of the site by searching for the domain. If the universe of urls is limited you can do with array See https://ideone.com/HL1NkN
– user60252
I think the easiest way is to simply access the page (e.g. Curl or gets_contents) and get the information from
<title>
and check what approaches with the domain, so if the site name isStack Overflow em Portugues
, sets toStack Overflow
. I think the only way.– Inkeliz
@Inkeliz, I think an example of what you said as an answer could bring an excellent answer to that question.
– UzumakiArtanis
@Uzumakiartanis, I’m doing this. D
– Inkeliz