When you try to find similarity between sentences, it is not a good idea to use n_grams of the entire sentence because it will find very similar passages and "find" that the two sentences have a very close resemblance.
In the case of similarity between sentences the first step is the normalization of the data:
Remove accent
Remove extra spacing
Set single case (high box or low box)
Remove stopwords (Optional, sometimes stopword itself helps a lot to find a pattern, but not always)
Below I made a code that calculates the similarity of the cosine, I will not delve very conceptually. you can find more details about the similarity of cosine in stefansavev’s blog
Basically you need to calculate the frequency vector of the sentences to after this apply the similarity calculation on these frequencies
For comparison I left a variable called use_text_bigram
to show how much the use of n_grams in words can be harmful to the algorithm. Another point I forgot to mention, the use of n_grams in tokens (word for word) can be important because the relevance of one word conditioned to another can have a very significant weight for example São Paulo
is quite different from Senhor Paulo
using 2 n_grams for tokens, if you use n_gram in the whole sentence you will have ("sa", "ao", "o ", "p", "pa", "au", "ul", "lo") compared to ("se", "en", "nh", "ho", "or", "r", "p", "pa", "au", "ul", "lo"). Finally these two sentences will have a low similarity using n_gram for tokens (around 33%) and a much higher similarity for n_gram in the sentence inteiar (around 48%)
Follows the code:
import nltk
import re
import math
import pandas as pd
from collections import Counter
from unicodedata import normalize
from nltk import ngrams
#Regex para encontrar tokens
REGEX_WORD = re.compile(r'\w+')
#Numero de tokens em sequencia
N_GRAM_TOKEN = 3
#Faz a normalizacao do texto removendo espacos a mais e tirando acentos
def text_normalizer(src):
return re.sub('\s+', ' ',
normalize('NFKD', src)
.encode('ASCII','ignore')
.decode('ASCII')
).lower().strip()
#Faz o calculo de similaridade baseada no coseno
def cosine_similarity(vec1, vec2):
intersection = set(vec1.keys()) & set(vec2.keys())
numerator = sum([vec1[x] * vec2[x] for x in intersection])
sum1 = sum([vec1[x]**2 for x in vec1.keys()])
sum2 = sum([vec2[x]**2 for x in vec2.keys()])
denominator = math.sqrt(sum1) * math.sqrt(sum2)
if not denominator:
return 0.0
else:
coef = float(numerator) / denominator
if coef > 1:
coef = 1
return coef
#Monta o vetor de frequencia da sentenca
def sentence_to_vector(text, use_text_bigram):
words = REGEX_WORD.findall(text)
accumulator = []
for n in range(1,N_GRAM_TOKEN):
gramas = ngrams(words, n)
for grama in gramas:
accumulator.append(str(grama))
if use_text_bigram:
pairs = get_text_bigrams(text)
for pair in pairs:
accumulator.append(pair)
return Counter(accumulator)
#Obtem a similaridade entre duas sentencas
def get_sentence_similarity(sentence1, sentence2, use_text_bigram=False):
vector1 = sentence_to_vector(text_normalizer(sentence1), use_text_bigram)
vector2 = sentence_to_vector(text_normalizer(sentence2), use_text_bigram)
return cosine_similarity(vector1, vector2)
#Metodo de gerar bigramas de uma string
def get_text_bigrams(src):
s = src.lower()
return [s[i:i+2] for i in range(len(s) - 1)]
if __name__ == "__main__":
w1 = 'COMERCIAL CASA DOS FRIOS - USAR LICINIO DIAS'
words = [
'ARES DOS ANDES - EXPORTACAO & IMPORTACAO LTDA',
'ADEGA DOS TRES IMPORTADORA',
'BODEGAS DE LOS ANDES COMERCIO DE VINHOS LTDA',
'ALL WINE IMPORTADORA'
]
print('Busca: ' + w1)
#Nivel de aceite (40%)
cutoff = 0.40
#Sentenças similares
result = []
for w2 in words:
print('\nCosine Sentence --- ' + w2)
#Calculo usando similaridade do coseno com apenas tokens
similarity_sentence = get_sentence_similarity(w1, w2)
print('\tSimilarity sentence: ' + str(similarity_sentence))
#Calculo usando similaridade do coseno com tokens e com ngramas do texto
similarity_sentence_text_bigram = get_sentence_similarity(w1, w2, use_text_bigram=True)
print('\tSimilarity sentence: ' + str(similarity_sentence_text_bigram))
if similarity_sentence >= cutoff:
result.append((w2, similarity_sentence))
print('\nResultado:')
#Exibe resultados
for data in result:
print(data)
The result was as follows::
CASA DOS FRIOS - USE LICINIO DIAS
Cosine Sentence --- ARES DOS ANDES - EXPORTACAO & IMPORTACAO LTDA
Similarity sentence: 0.08362420100070908
Similarity sentence text bigram: 0.26518576139191
Cosine Sentence --- ADEGA DOS TRES IMPORTADORA
Similarity sentence: 0.10482848367219183
Similarity sentence text bigram: 0.223606797749979
Cosine Sentence --- BODEGAS DE LOS ANDES COMERCIO DE VINHOS LTDA
Similarity sentence: 0.0
Similarity sentence text bigram: 0.39317854974639244
Cosine Sentence --- ALL WINE IMPORTER
Similarity sentence: 0.0
Similarity sentence text bigram: 0.09245003270420486
See that using the text bigram
He overestimates the model thinking it has a lot of resemblance when it doesn’t, this is explained because several bigrams (co, om, me, er, rc, ci, ia, al) repeat themselves a lot. Now when the model uses only tokens it converges much better saying that there really is not much similarity between these data
To use the method you were already using to calculate similarity (strike_match) you can:
#Faz o calculo de similaridade baseada no strike match
def strike_match(vec1, vec2):
pairs1 = vec1.keys()
pairs2 = vec2.keys()
union = len(pairs1) + len(pairs2)
hit_count = 0
for x in pairs1:
for y in pairs2:
if x == y:
hit_count += 1
break
return (2.0 * hit_count) / union
Ai in the method get_sentence_similarity
just change the line return cosine_similarity(vector1, vector2)
for return strike_match(vector1, vector2)
Editing the answer as the other questions:
To make these exchanges using the pandas dataframe you can use the same structure already mentioned, but the output would create the following method:
import numpy as np
def get_dataframe_similarity(comparer, finder, cutoff):
print('cutoff= ' + str(cutoff))
result = []
comparer = np.array(comparer)
for find in np.array(finder):
max_coef = 0
data = find
for compare in comparer:
similarity = get_sentence_similarity(find[0], compare[0])
if similarity >= cutoff:
if similarity > max_coef:
print('Trocando ' + data[1] + ' por ' + compare[1])
print(data[0] + ' ---- ' + compare[0] + ' - similaridade: ' + str(float( '%g' % ( similarity * 100 ) )) + '%')
data[1] = compare[1]
max_coef = similarity
result.append(data)
result = np.array(result)
dataFrame = pd.DataFrame()
dataFrame['texto'] = result[..., 0]
dataFrame['marca'] = result[..., 1]
return dataFrame
it receives a comparison dataframe and a search, and returns the dataframe with the mark modifications as specified cutoff.
To use it you can do this way:
if __name__ == "__main__":
cutoff = 0.4
dataFrame1 = pd.DataFrame()
dataFrame1['texto'] = ['COMERCIAL CASA DOS FRIOS - USAR LICINIO DIAS']
dataFrame1['marca'] = ['xpto']
dataFrame2 = pd.DataFrame()
dataFrame2['texto'] = ['ARES DOS ANDES - EXPORTACAO & IMPORTACAO LTDA', 'ADEGA DOS TRES IMPORTADORA', 'BODEGAS DE LOS ANDES COMERCIO DE VINHOS LTDA', 'ALL WINE IMPORTADORA']
dataFrame2['marca'] = ['marca1', 'marca2', 'marca3', 'marca4']
dataResult = get_dataframe_similarity(comparer=dataFrame1, finder=dataFrame2, cutoff=cutoff)
print(dataResult)
thanks. Could you guide me on how to solve the other questions I posed? Thank you for your help!
– Bene
Opa, of course, I’ll edit the answer, but at first you can use this same answer structure
– brow-joe
edited response
– brow-joe
Dude you broke a huge branch, thank you so much!!
– Bene
Just one more question: I noticed that to test the script with the modifications, vc is passing each comparison string manually in dataFrame1 and dataFrame2. How can I pass this dynamically, so that line 1 of dataFrame1, compared to the lines of dataFrame2 to find the similarity required in cutoff?? Thank you
– Bene
in the current method
get_dataframe_similarity
it already takes the positions of the columns as being position 0 the texts and position 1 the marks. Another alternative is to do using theix
for example: for textstextos = dataFrame.ix[:, 0]
and for brandsmarcas = dataFrame.ix[:, 1]
– brow-joe
Okay, thank you and congratulations!
– Bene
I tested the algorithm with the two dataframes and this working well, it occurs that the precision is that this somewhat complicated. When I use 60% it puts as similar little strings, if I put 20% occurs the same, already used 50% and the wrong results keep appearing. How can I improve this accuracy to the maximum. Thank you!
– Bene
I like to use a score for cutting at least 70% (still with errors) or 80% (more accurate)
– brow-joe
The point is that with 80%, a string like: red wine crossing, 6x75cl in comparison with wine crossing 12x75cl, 80 boxes, concha y toro, it will say q is not the same product, when in reality it is. Understood?
– Bene
A yes, intendi, in this case I have two possible solutions, the first is to consider the bigram of each token, the second to smooth the model, remove all numbers
– brow-joe
the first serial solution something like this, in the method
sentence_to_vector
right after the lineaccumulator.append(str(grama))
you do the following, scroll through each tokenfor n_gram in grama:
ai for each token Voce recovers the bigramasbigram_token = ngrams(n_gram, 2)
Then go through each generated bigramafor bigram in bigram_token:
and add to the accumulatoraccumulator.append(str(bigram))
– brow-joe
to remove all numbers just use the
re.sub('\d+', '',text_normalizer(sentence))
within the methodget_sentence_similarity
– brow-joe
Oops, I had already used:
re.sub('[0-9]+', '', result)
to extract the numbers. I also created a list of stopwords that appear too often to eliminate them using a looping that checks whether the received content contains these words. improved accuracy (I’m using 80%), but still have some divergences.– Bene
I put the number removal and stopwords inside the function:
def text_normalizer(src):
– Bene
I’m getting an error: Typeerror: normalize() argument 2 must be str, not float in function def text_normalizer(src) . I have already checked the files and the error persists.
– Bene