This similarity formula you are using is not suitable for the distance calculation that the library polyglot
gives you. There is another formula of angle between vectors that is easier to apply and gives you a result between -1 and 1 and is called Cosine Similarity. Specifically in dealing with documents for natural language processing, as a rule the vector is always positive and the result of the application of the Cosine similarity is between 0 and 1.
Basically the concept is as follows: there are two vectors, the vector w and the vector v, as image below. The vector v makes a projection (a shadow) on the vector w. The size of this shadow relative to the vector w is what defines the similarity. When the value of the projection is 1, it means that the vector v is right on top of the vector w; both are parallel and identical. When the projection value is 0, it means that the vector v is at 90° of vector w; that is, they have no similarity. If the vector v was completely opposite to w, in the left direction, the value of the projection would be -1; that is, the opposite. As I said, we expect only values between 0 and 1 because of the positive nature of vectors.
Anyway, as you’re starting out, I’ll give you an alternative using Cosine similarity to arrive at a measure of word similarity. To get started, have the library spacy
in your environment. You can get it by:
pip install -U spacy
As I believe you will work with Portuguese, download also the language models package and the convolutional neural network trained:
python -m spacy download pt
python -m spacy download pt_core_news_sm
Now the code to use this library:
import spacy
# essas bibliotecas abaixo são só para plotar o resultado
import numpy as np
import pandas as pd
import seaborn as sns
# carregue o modelo
nlp = spacy.load('pt_core_news_sm')
# insira o input sempre em unicode
palavras = nlp(u'luz claridade amargo salgado')
dados=[]
for palavra1 in palavras:
for palavra2 in palavras:
dados.append(palavra1.similarity(palavra2)) # aqui eu testo a similaridade
# organização dos dados
dados = np.asarray(dados).reshape(len(palavras),len(palavras))
rotulo = [str(palavra) for palavra in palavras]
dados = pd.DataFrame(dados,rotulo,rotulo)
# plotagem
print(dados)
sns.heatmap(dados,annot=True,fmt=".2f",cmap="Blues_r",cbar=False,square=True,xticklabels='auto')
There’s a lot more you can do with this library. Look for the features on documentation and in websites that talk about her.
And as you say you’re starting out and it’s all new, in case you want to dig deeper into natural language processing (NLP), I recommend researching more about NLP (there are several good books about it) and libraries nltk
, stanfordcorenlp
, gensim
and textblob
.
Where did you get the method for this similarity formula? Most likely this way of calculating similarity is not compatible with what the Polyglot library provides. Give us more information so we can help you.
– Rafael Barros
Coming to work today, I share the code.
– Thiago de Paula
Edit the question because there is no formatting here.
– Rafael Barros
Rafael, the method is Embedding distances which is a Polyglot package. This method returns the distance of a word to a set of words that is within the vector. My question is: What is the maximum distance between a word and another?
– Thiago de Paula
from Polyglot.Mapping import Embedding embeddings = Embedding.load(os.environ['polyglot_data']+r' embeddings2 pt embeddings_pkl.tar.bz2') d = embeddings.distances('Green', ['Yellow', 'red', 'blue') )
– Thiago de Paula
Is this similarity calculation based on which distance method? Have you wondered if it is compatible with the Polyglot distance method? I went in the documentation of Polyglot and there is nothing regarding the calculation of similarity and not even what kind of method embedding.distances() uses to calculate the distance between two words. https://polyglot.readthedocs.io/en/latest/Embeddings.html#Nearest-Neighbors
– Rafael Barros
The documentation is here . https://polyglot.readthedocs.io/en/latest/Embeddings.html The method it uses is Euclidean Distance
– Thiago de Paula
And the method of similarity?
– Rafael Barros
It makes no sense for me to find the maximum Euclidean distance between two points, since the Euclidean distance can reach infinity. I even checked in Matering Natural Language Processing with Python and there is no method of similarity applied to the Euclidean distance between words. I advise you to set aside the Polyglot for this endeavor and search nltk.
– Rafael Barros
I understand Rafael. I’m starting and it’s all very new to me. Now tell me something. embbedding has a method (words) that returns all words in the corpus. I made a script that calculates the distance between all words and keeps the distance between all comparisons. The largest number returned by the script was something close to 2. Knowing that distances tend to infinity, embeddin provides a method that normalizes the distances of words. The method is normalize_words(). Looking at its definition, I saw that it takes an Ord parameter that the default is 2.
– Thiago de Paula
I just can’t answer if this parameter is the maximum distance a word can have from another. I’m pretty sure that’s right.
– Thiago de Paula
I’ll help you. Wait.
– Rafael Barros