Help with Strike Match Similarity Algorithm

Asked

Viewed 1,608 times

1

Friends need a help in implementing the algorithm below that looks for similarities:

import nltk 
import pandas as pd

def get_bigrams(string):
    s = string.lower()
    return [s[i:i+2] for i in range(len(s) - 1)]

def string_similarity(str1, str2):
    pairs1 = get_bigrams(str1)
    pairs2 = get_bigrams(str2)
    union  = len(pairs1) + len(pairs2)
    hit_count = 0
    for x in pairs1:
        for y in pairs2:
            if x == y:
                hit_count += 1
                break
    return (2.0 * hit_count) / union

if __name__ == "__main__":
    w1 = 'COMERCIAL CASA DOS FRIOS - USAR LICINIO DIAS'
    words = ['ARES DOS ANDES - EXPORTACAO & IMPORTACAO LTDA', 'ADEGA DOS TRES IMPORTADORA', 'BODEGAS DE LOS ANDES COMERCIO DE VINHOS LTDA', 'ALL WINE IMPORTADORA']

    for w2 in words:
        print('Result --- ' + w2)
        print(string_similarity(w1, w2))

When I run this script with comparisons between w1 and words, I get the percentages of similarity below:

Result --- ARES DOS ANDES - EXPORTACAO & IMPORTACAO LTDA
0.2988505747126437
Result --- ADEGA DOS TRES IMPORTADORA
0.23529411764705882
Result --- BODEGAS DE LOS ANDES COMERCIO DE VINHOS LTDA
0.4883720930232558
Result --- ALL WINE IMPORTADORA
0.12903225806451613

I happen to be getting 40% similarity in the third comparison, where the texts have almost no similarities to get to this point.

I need the following help:

  1. I need the get_bigrams function to receive two columns different dataframes (dataframe1 and dataframe2) to compare between si (currently it receives strings)
  2. I need to improve the acceptance level of similarity. Maybe with NLTK removing stopwords, spaces, accents and etc, but do not know as well as integrate this all.
  3. I need to post a parole so that when I get a level 35-40% similarity, a corresponding value of another column (call tag) from one of the dataframes be copied to the other.

    Ex: dataframe1 ababababababa brand=empty dataframe2 ababababab brand= xxxxx

If the above example is more than 35% similarity, then copy the dataframe2 tag to the dataframe1 tag. This for each line compared.

NOTE: I am beginner in python. I have studied a lot.

1 answer

3


When you try to find similarity between sentences, it is not a good idea to use n_grams of the entire sentence because it will find very similar passages and "find" that the two sentences have a very close resemblance.

In the case of similarity between sentences the first step is the normalization of the data:

Remove accent

Remove extra spacing

Set single case (high box or low box)

Remove stopwords (Optional, sometimes stopword itself helps a lot to find a pattern, but not always)

Below I made a code that calculates the similarity of the cosine, I will not delve very conceptually. you can find more details about the similarity of cosine in stefansavev’s blog

Basically you need to calculate the frequency vector of the sentences to after this apply the similarity calculation on these frequencies

For comparison I left a variable called use_text_bigram to show how much the use of n_grams in words can be harmful to the algorithm. Another point I forgot to mention, the use of n_grams in tokens (word for word) can be important because the relevance of one word conditioned to another can have a very significant weight for example São Paulo is quite different from Senhor Paulo using 2 n_grams for tokens, if you use n_gram in the whole sentence you will have ("sa", "ao", "o ", "p", "pa", "au", "ul", "lo") compared to ("se", "en", "nh", "ho", "or", "r", "p", "pa", "au", "ul", "lo"). Finally these two sentences will have a low similarity using n_gram for tokens (around 33%) and a much higher similarity for n_gram in the sentence inteiar (around 48%)

Follows the code:

import nltk 
import re
import math

import pandas as pd

from collections import Counter
from unicodedata import normalize
from nltk import ngrams

#Regex para encontrar tokens
REGEX_WORD = re.compile(r'\w+')
#Numero de tokens em sequencia
N_GRAM_TOKEN = 3

#Faz a normalizacao do texto removendo espacos a mais e tirando acentos
def text_normalizer(src):
    return re.sub('\s+', ' ',
                normalize('NFKD', src)
                   .encode('ASCII','ignore')
                   .decode('ASCII')
           ).lower().strip()

#Faz o calculo de similaridade baseada no coseno
def cosine_similarity(vec1, vec2):
    intersection = set(vec1.keys()) & set(vec2.keys())
    numerator = sum([vec1[x] * vec2[x] for x in intersection])

    sum1 = sum([vec1[x]**2 for x in vec1.keys()])
    sum2 = sum([vec2[x]**2 for x in vec2.keys()])
    denominator = math.sqrt(sum1) * math.sqrt(sum2)

    if not denominator:
        return 0.0
    else:
        coef = float(numerator) / denominator
        if coef > 1:
            coef = 1
        return coef

#Monta o vetor de frequencia da sentenca
def sentence_to_vector(text, use_text_bigram):
    words = REGEX_WORD.findall(text)
    accumulator = []
    for n in range(1,N_GRAM_TOKEN):
        gramas = ngrams(words, n)
        for grama in gramas:
            accumulator.append(str(grama))

    if use_text_bigram:
        pairs = get_text_bigrams(text)
        for pair in pairs:
            accumulator.append(pair)

    return Counter(accumulator)

#Obtem a similaridade entre duas sentencas
def get_sentence_similarity(sentence1, sentence2, use_text_bigram=False):
    vector1 = sentence_to_vector(text_normalizer(sentence1), use_text_bigram)
    vector2 = sentence_to_vector(text_normalizer(sentence2), use_text_bigram)
    return cosine_similarity(vector1, vector2)

#Metodo de gerar bigramas de uma string
def get_text_bigrams(src):
    s = src.lower()
    return [s[i:i+2] for i in range(len(s) - 1)]

if __name__ == "__main__":
    w1 = 'COMERCIAL CASA DOS FRIOS - USAR LICINIO DIAS'
    words = [
        'ARES DOS ANDES - EXPORTACAO & IMPORTACAO LTDA', 
        'ADEGA DOS TRES IMPORTADORA', 
        'BODEGAS DE LOS ANDES COMERCIO DE VINHOS LTDA', 
        'ALL WINE IMPORTADORA'
    ]

    print('Busca: ' + w1)

    #Nivel de aceite (40%)
    cutoff = 0.40
    #Sentenças similares
    result = []

    for w2 in words:
        print('\nCosine Sentence --- ' + w2)

        #Calculo usando similaridade do coseno com apenas tokens
        similarity_sentence = get_sentence_similarity(w1, w2)
        print('\tSimilarity sentence: ' + str(similarity_sentence))

        #Calculo usando similaridade do coseno com tokens e com ngramas do texto
        similarity_sentence_text_bigram = get_sentence_similarity(w1, w2, use_text_bigram=True)
        print('\tSimilarity sentence: ' + str(similarity_sentence_text_bigram))

        if similarity_sentence >= cutoff:
            result.append((w2, similarity_sentence))

    print('\nResultado:')
    #Exibe resultados
    for data in result:
        print(data)

The result was as follows::

CASA DOS FRIOS - USE LICINIO DIAS

Cosine Sentence --- ARES DOS ANDES - EXPORTACAO & IMPORTACAO LTDA

    Similarity sentence: 0.08362420100070908

    Similarity sentence text bigram: 0.26518576139191

Cosine Sentence --- ADEGA DOS TRES IMPORTADORA

    Similarity sentence: 0.10482848367219183

    Similarity sentence text bigram: 0.223606797749979

Cosine Sentence --- BODEGAS DE LOS ANDES COMERCIO DE VINHOS LTDA

    Similarity sentence: 0.0

    Similarity sentence text bigram: 0.39317854974639244

Cosine Sentence --- ALL WINE IMPORTER

    Similarity sentence: 0.0

    Similarity sentence text bigram: 0.09245003270420486

See that using the text bigram He overestimates the model thinking it has a lot of resemblance when it doesn’t, this is explained because several bigrams (co, om, me, er, rc, ci, ia, al) repeat themselves a lot. Now when the model uses only tokens it converges much better saying that there really is not much similarity between these data

To use the method you were already using to calculate similarity (strike_match) you can:

#Faz o calculo de similaridade baseada no strike match
def strike_match(vec1, vec2):
    pairs1 = vec1.keys()
    pairs2 = vec2.keys()
    union  = len(pairs1) + len(pairs2)
    hit_count = 0
    for x in pairs1:
        for y in pairs2:
            if x == y:
                hit_count += 1
                break
    return (2.0 * hit_count) / union

Ai in the method get_sentence_similarity just change the line return cosine_similarity(vector1, vector2) for return strike_match(vector1, vector2)

Editing the answer as the other questions:

To make these exchanges using the pandas dataframe you can use the same structure already mentioned, but the output would create the following method:

import numpy as np

def get_dataframe_similarity(comparer, finder, cutoff):
    print('cutoff= ' + str(cutoff))
    result = []
    comparer = np.array(comparer)
    for find in np.array(finder):
        max_coef = 0
        data = find
        for compare in comparer:
            similarity = get_sentence_similarity(find[0], compare[0])
            if similarity >= cutoff:
                if similarity > max_coef:
                    print('Trocando ' + data[1] + ' por ' + compare[1])
                    print(data[0] + ' ---- ' + compare[0] + ' - similaridade: ' + str(float( '%g' % ( similarity * 100 ) )) + '%')
                    data[1] = compare[1]
                    max_coef = similarity
        result.append(data)

    result = np.array(result)
    dataFrame = pd.DataFrame()
    dataFrame['texto'] = result[..., 0]
    dataFrame['marca'] = result[..., 1]
    return dataFrame

it receives a comparison dataframe and a search, and returns the dataframe with the mark modifications as specified cutoff.

To use it you can do this way:

if __name__ == "__main__":
    cutoff = 0.4
    dataFrame1 = pd.DataFrame()
    dataFrame1['texto'] = ['COMERCIAL CASA DOS FRIOS - USAR LICINIO DIAS']
    dataFrame1['marca'] = ['xpto']

    dataFrame2 = pd.DataFrame()
    dataFrame2['texto'] = ['ARES DOS ANDES - EXPORTACAO & IMPORTACAO LTDA', 'ADEGA DOS TRES IMPORTADORA', 'BODEGAS DE LOS ANDES COMERCIO DE VINHOS LTDA', 'ALL WINE IMPORTADORA']
    dataFrame2['marca'] = ['marca1', 'marca2', 'marca3', 'marca4']

    dataResult = get_dataframe_similarity(comparer=dataFrame1, finder=dataFrame2, cutoff=cutoff)
    print(dataResult)
  • thanks. Could you guide me on how to solve the other questions I posed? Thank you for your help!

  • Opa, of course, I’ll edit the answer, but at first you can use this same answer structure

  • edited response

  • Dude you broke a huge branch, thank you so much!!

  • Just one more question: I noticed that to test the script with the modifications, vc is passing each comparison string manually in dataFrame1 and dataFrame2. How can I pass this dynamically, so that line 1 of dataFrame1, compared to the lines of dataFrame2 to find the similarity required in cutoff?? Thank you

  • in the current method get_dataframe_similarity it already takes the positions of the columns as being position 0 the texts and position 1 the marks. Another alternative is to do using the ix for example: for texts textos = dataFrame.ix[:, 0] and for brands marcas = dataFrame.ix[:, 1]

  • Okay, thank you and congratulations!

  • I tested the algorithm with the two dataframes and this working well, it occurs that the precision is that this somewhat complicated. When I use 60% it puts as similar little strings, if I put 20% occurs the same, already used 50% and the wrong results keep appearing. How can I improve this accuracy to the maximum. Thank you!

  • I like to use a score for cutting at least 70% (still with errors) or 80% (more accurate)

  • The point is that with 80%, a string like: red wine crossing, 6x75cl in comparison with wine crossing 12x75cl, 80 boxes, concha y toro, it will say q is not the same product, when in reality it is. Understood?

  • A yes, intendi, in this case I have two possible solutions, the first is to consider the bigram of each token, the second to smooth the model, remove all numbers

  • the first serial solution something like this, in the method sentence_to_vector right after the line accumulator.append(str(grama)) you do the following, scroll through each token for n_gram in grama: ai for each token Voce recovers the bigramas bigram_token = ngrams(n_gram, 2) Then go through each generated bigrama for bigram in bigram_token: and add to the accumulator accumulator.append(str(bigram))

  • to remove all numbers just use the re.sub('\d+', '',text_normalizer(sentence)) within the method get_sentence_similarity

  • Oops, I had already used: re.sub('[0-9]+', '', result) to extract the numbers. I also created a list of stopwords that appear too often to eliminate them using a looping that checks whether the received content contains these words. improved accuracy (I’m using 80%), but still have some divergences.

  • I put the number removal and stopwords inside the function: def text_normalizer(src):

  • I’m getting an error: Typeerror: normalize() argument 2 must be str, not float in function def text_normalizer(src) . I have already checked the files and the error persists.

Show 11 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.