How to calculate the similarity between two texts using word embedding in Python?

Asked

Viewed 1,022 times

1

I’m using word embedding of polyglot to calculate the similarity between two texts, but I’m having difficulty getting to the final similarity calculation.

The method embedding.distances(word,[words]) returns the distance of a word for each word that is in the array. The problem is I don’t know how to identify which scale is each distance calculated by distances. This scale is extremely important for calculating the similarity index.

Similaridade = (soma da distância mínima de cada palavra) / (quantidade de elementos * valor máximo de distância de cada palavra)

I don’t know how to find out valor máximo da distância de cada palavra

  • Where did you get the method for this similarity formula? Most likely this way of calculating similarity is not compatible with what the Polyglot library provides. Give us more information so we can help you.

  • Coming to work today, I share the code.

  • Edit the question because there is no formatting here.

  • Rafael, the method is Embedding distances which is a Polyglot package. This method returns the distance of a word to a set of words that is within the vector. My question is: What is the maximum distance between a word and another?

  • from Polyglot.Mapping import Embedding embeddings = Embedding.load(os.environ['polyglot_data']+r' embeddings2 pt embeddings_pkl.tar.bz2') d = embeddings.distances('Green', ['Yellow', 'red', 'blue') )

  • Is this similarity calculation based on which distance method? Have you wondered if it is compatible with the Polyglot distance method? I went in the documentation of Polyglot and there is nothing regarding the calculation of similarity and not even what kind of method embedding.distances() uses to calculate the distance between two words. https://polyglot.readthedocs.io/en/latest/Embeddings.html#Nearest-Neighbors

  • The documentation is here . https://polyglot.readthedocs.io/en/latest/Embeddings.html The method it uses is Euclidean Distance

  • And the method of similarity?

  • It makes no sense for me to find the maximum Euclidean distance between two points, since the Euclidean distance can reach infinity. I even checked in Matering Natural Language Processing with Python and there is no method of similarity applied to the Euclidean distance between words. I advise you to set aside the Polyglot for this endeavor and search nltk.

  • I understand Rafael. I’m starting and it’s all very new to me. Now tell me something. embbedding has a method (words) that returns all words in the corpus. I made a script that calculates the distance between all words and keeps the distance between all comparisons. The largest number returned by the script was something close to 2. Knowing that distances tend to infinity, embeddin provides a method that normalizes the distances of words. The method is normalize_words(). Looking at its definition, I saw that it takes an Ord parameter that the default is 2.

  • I just can’t answer if this parameter is the maximum distance a word can have from another. I’m pretty sure that’s right.

  • I’ll help you. Wait.

Show 7 more comments

1 answer

1

This similarity formula you are using is not suitable for the distance calculation that the library polyglot gives you. There is another formula of angle between vectors that is easier to apply and gives you a result between -1 and 1 and is called Cosine Similarity. Specifically in dealing with documents for natural language processing, as a rule the vector is always positive and the result of the application of the Cosine similarity is between 0 and 1.

Basically the concept is as follows: there are two vectors, the vector w and the vector v, as image below. The vector v makes a projection (a shadow) on the vector w. The size of this shadow relative to the vector w is what defines the similarity. When the value of the projection is 1, it means that the vector v is right on top of the vector w; both are parallel and identical. When the projection value is 0, it means that the vector v is at 90° of vector w; that is, they have no similarity. If the vector v was completely opposite to w, in the left direction, the value of the projection would be -1; that is, the opposite. As I said, we expect only values between 0 and 1 because of the positive nature of vectors.

Cosine similarity

Anyway, as you’re starting out, I’ll give you an alternative using Cosine similarity to arrive at a measure of word similarity. To get started, have the library spacy in your environment. You can get it by:

pip install -U spacy

As I believe you will work with Portuguese, download also the language models package and the convolutional neural network trained:

python -m spacy download pt
python -m spacy download pt_core_news_sm

Now the code to use this library:

import spacy
# essas bibliotecas abaixo são só para plotar o resultado
import numpy as np
import pandas as pd
import seaborn as sns

# carregue o modelo
nlp = spacy.load('pt_core_news_sm')

# insira o input sempre em unicode                
palavras = nlp(u'luz claridade amargo salgado')    

dados=[]
for palavra1 in palavras:
    for palavra2 in palavras:
        dados.append(palavra1.similarity(palavra2))    # aqui eu testo a similaridade

# organização dos dados
dados = np.asarray(dados).reshape(len(palavras),len(palavras))
rotulo = [str(palavra) for palavra in palavras]
dados = pd.DataFrame(dados,rotulo,rotulo)

# plotagem
print(dados)
sns.heatmap(dados,annot=True,fmt=".2f",cmap="Blues_r",cbar=False,square=True,xticklabels='auto')

There’s a lot more you can do with this library. Look for the features on documentation and in websites that talk about her.

And as you say you’re starting out and it’s all new, in case you want to dig deeper into natural language processing (NLP), I recommend researching more about NLP (there are several good books about it) and libraries nltk, stanfordcorenlp, gensim and textblob.

  • Thanks Rafael! Excellent. I will test tomorrow. Thank you very much. I am insisting on embeddings, because I have heard that it considers the context of the word and significantly improves the assertiveness of the similarity algorithm.

  • If you are really interested in word embedding, try to study about Word2vec. The quality of such an approach will depend greatly on the training of the neural network used. There’s a lot of stuff out there, but it’s not always suited to our need.

  • It is important to understand how the concept of word embedding works and how the methods of Word2vec and Cosine similarity come into solution. It has a lot of articles and videos that explain how it works.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.