Help with Strike Match Similarity Algorithm

Question

Help with Strike Match Similarity Algorithm

Asked 8 years ago

Viewed 1,608 times

1

Friends need a help in implementing the algorithm below that looks for similarities:

import nltk 
import pandas as pd

def get_bigrams(string):
    s = string.lower()
    return [s[i:i+2] for i in range(len(s) - 1)]

def string_similarity(str1, str2):
    pairs1 = get_bigrams(str1)
    pairs2 = get_bigrams(str2)
    union  = len(pairs1) + len(pairs2)
    hit_count = 0
    for x in pairs1:
        for y in pairs2:
            if x == y:
                hit_count += 1
                break
    return (2.0 * hit_count) / union

if __name__ == "__main__":
    w1 = 'COMERCIAL CASA DOS FRIOS - USAR LICINIO DIAS'
    words = ['ARES DOS ANDES - EXPORTACAO & IMPORTACAO LTDA', 'ADEGA DOS TRES IMPORTADORA', 'BODEGAS DE LOS ANDES COMERCIO DE VINHOS LTDA', 'ALL WINE IMPORTADORA']

    for w2 in words:
        print('Result --- ' + w2)
        print(string_similarity(w1, w2))

When I run this script with comparisons between w1 and words, I get the percentages of similarity below:

Result --- ARES DOS ANDES - EXPORTACAO & IMPORTACAO LTDA
0.2988505747126437
Result --- ADEGA DOS TRES IMPORTADORA
0.23529411764705882
Result --- BODEGAS DE LOS ANDES COMERCIO DE VINHOS LTDA
0.4883720930232558
Result --- ALL WINE IMPORTADORA
0.12903225806451613

I happen to be getting 40% similarity in the third comparison, where the texts have almost no similarities to get to this point.

I need the following help:

I need the get_bigrams function to receive two columns different dataframes (dataframe1 and dataframe2) to compare between si (currently it receives strings)
I need to improve the acceptance level of similarity. Maybe with NLTK removing stopwords, spaces, accents and etc, but do not know as well as integrate this all.
I need to post a parole so that when I get a level 35-40% similarity, a corresponding value of another column (call tag) from one of the dataframes be copied to the other.

Ex: dataframe1 ababababababa brand=empty dataframe2 ababababab brand= xxxxx

If the above example is more than 35% similarity, then copy the dataframe2 tag to the dataframe1 tag. This for each line compared.

NOTE: I am beginner in python. I have studied a lot.

1 answer

Browser other questions tagged python pandas machine-learning

You are not signed in. Login or sign up in order to post.

by brow-joe • **2,144** points · Answer 1 · 2017-05-09T19:47:26+00:00

When you try to find similarity between sentences, it is not a good idea to use n_grams of the entire sentence because it will find very similar passages and "find" that the two sentences have a very close resemblance.

In the case of similarity between sentences the first step is the normalization of the data:

Remove accent

Remove extra spacing

Set single case (high box or low box)

Remove stopwords (Optional, sometimes stopword itself helps a lot to find a pattern, but not always)

Below I made a code that calculates the similarity of the cosine, I will not delve very conceptually. you can find more details about the similarity of cosine in stefansavev’s blog

Basically you need to calculate the frequency vector of the sentences to after this apply the similarity calculation on these frequencies

For comparison I left a variable called use_text_bigram to show how much the use of n_grams in words can be harmful to the algorithm. Another point I forgot to mention, the use of n_grams in tokens (word for word) can be important because the relevance of one word conditioned to another can have a very significant weight for example São Paulo is quite different from Senhor Paulo using 2 n_grams for tokens, if you use n_gram in the whole sentence you will have ("sa", "ao", "o ", "p", "pa", "au", "ul", "lo") compared to ("se", "en", "nh", "ho", "or", "r", "p", "pa", "au", "ul", "lo"). Finally these two sentences will have a low similarity using n_gram for tokens (around 33%) and a much higher similarity for n_gram in the sentence inteiar (around 48%)

Follows the code:

import nltk 
import re
import math

import pandas as pd

from collections import Counter
from unicodedata import normalize
from nltk import ngrams

#Regex para encontrar tokens
REGEX_WORD = re.compile(r'\w+')
#Numero de tokens em sequencia
N_GRAM_TOKEN = 3

#Faz a normalizacao do texto removendo espacos a mais e tirando acentos
def text_normalizer(src):
    return re.sub('\s+', ' ',
                normalize('NFKD', src)
                   .encode('ASCII','ignore')
                   .decode('ASCII')
           ).lower().strip()

#Faz o calculo de similaridade baseada no coseno
def cosine_similarity(vec1, vec2):
    intersection = set(vec1.keys()) & set(vec2.keys())
    numerator = sum([vec1[x] * vec2[x] for x in intersection])

    sum1 = sum([vec1[x]**2 for x in vec1.keys()])
    sum2 = sum([vec2[x]**2 for x in vec2.keys()])
    denominator = math.sqrt(sum1) * math.sqrt(sum2)

    if not denominator:
        return 0.0
    else:
        coef = float(numerator) / denominator
        if coef > 1:
            coef = 1
        return coef

#Monta o vetor de frequencia da sentenca
def sentence_to_vector(text, use_text_bigram):
    words = REGEX_WORD.findall(text)
    accumulator = []
    for n in range(1,N_GRAM_TOKEN):
        gramas = ngrams(words, n)
        for grama in gramas:
            accumulator.append(str(grama))

    if use_text_bigram:
        pairs = get_text_bigrams(text)
        for pair in pairs:
            accumulator.append(pair)

    return Counter(accumulator)

#Obtem a similaridade entre duas sentencas
def get_sentence_similarity(sentence1, sentence2, use_text_bigram=False):
    vector1 = sentence_to_vector(text_normalizer(sentence1), use_text_bigram)
    vector2 = sentence_to_vector(text_normalizer(sentence2), use_text_bigram)
    return cosine_similarity(vector1, vector2)

#Metodo de gerar bigramas de uma string
def get_text_bigrams(src):
    s = src.lower()
    return [s[i:i+2] for i in range(len(s) - 1)]

if __name__ == "__main__":
    w1 = 'COMERCIAL CASA DOS FRIOS - USAR LICINIO DIAS'
    words = [
        'ARES DOS ANDES - EXPORTACAO & IMPORTACAO LTDA', 
        'ADEGA DOS TRES IMPORTADORA', 
        'BODEGAS DE LOS ANDES COMERCIO DE VINHOS LTDA', 
        'ALL WINE IMPORTADORA'
    ]

    print('Busca: ' + w1)

    #Nivel de aceite (40%)
    cutoff = 0.40
    #Sentenças similares
    result = []

    for w2 in words:
        print('\nCosine Sentence --- ' + w2)

        #Calculo usando similaridade do coseno com apenas tokens
        similarity_sentence = get_sentence_similarity(w1, w2)
        print('\tSimilarity sentence: ' + str(similarity_sentence))

        #Calculo usando similaridade do coseno com tokens e com ngramas do texto
        similarity_sentence_text_bigram = get_sentence_similarity(w1, w2, use_text_bigram=True)
        print('\tSimilarity sentence: ' + str(similarity_sentence_text_bigram))

        if similarity_sentence >= cutoff:
            result.append((w2, similarity_sentence))

    print('\nResultado:')
    #Exibe resultados
    for data in result:
        print(data)

The result was as follows::

CASA DOS FRIOS - USE LICINIO DIAS

Cosine Sentence --- ARES DOS ANDES - EXPORTACAO & IMPORTACAO LTDA
    Similarity sentence: 0.08362420100070908

    Similarity sentence text bigram: 0.26518576139191
Cosine Sentence --- ADEGA DOS TRES IMPORTADORA
    Similarity sentence: 0.10482848367219183

    Similarity sentence text bigram: 0.223606797749979
Cosine Sentence --- BODEGAS DE LOS ANDES COMERCIO DE VINHOS LTDA
    Similarity sentence: 0.0

    Similarity sentence text bigram: 0.39317854974639244
Cosine Sentence --- ALL WINE IMPORTER
    Similarity sentence: 0.0

    Similarity sentence text bigram: 0.09245003270420486

See that using the text bigram He overestimates the model thinking it has a lot of resemblance when it doesn’t, this is explained because several bigrams (co, om, me, er, rc, ci, ia, al) repeat themselves a lot. Now when the model uses only tokens it converges much better saying that there really is not much similarity between these data

To use the method you were already using to calculate similarity (strike_match) you can:

#Faz o calculo de similaridade baseada no strike match
def strike_match(vec1, vec2):
    pairs1 = vec1.keys()
    pairs2 = vec2.keys()
    union  = len(pairs1) + len(pairs2)
    hit_count = 0
    for x in pairs1:
        for y in pairs2:
            if x == y:
                hit_count += 1
                break
    return (2.0 * hit_count) / union

Ai in the method get_sentence_similarity just change the line return cosine_similarity(vector1, vector2) for return strike_match(vector1, vector2)

Editing the answer as the other questions:

To make these exchanges using the pandas dataframe you can use the same structure already mentioned, but the output would create the following method:

import numpy as np

def get_dataframe_similarity(comparer, finder, cutoff):
    print('cutoff= ' + str(cutoff))
    result = []
    comparer = np.array(comparer)
    for find in np.array(finder):
        max_coef = 0
        data = find
        for compare in comparer:
            similarity = get_sentence_similarity(find[0], compare[0])
            if similarity >= cutoff:
                if similarity > max_coef:
                    print('Trocando ' + data[1] + ' por ' + compare[1])
                    print(data[0] + ' ---- ' + compare[0] + ' - similaridade: ' + str(float( '%g' % ( similarity * 100 ) )) + '%')
                    data[1] = compare[1]
                    max_coef = similarity
        result.append(data)

    result = np.array(result)
    dataFrame = pd.DataFrame()
    dataFrame['texto'] = result[..., 0]
    dataFrame['marca'] = result[..., 1]
    return dataFrame

it receives a comparison dataframe and a search, and returns the dataframe with the mark modifications as specified cutoff.

To use it you can do this way:

if __name__ == "__main__":
    cutoff = 0.4
    dataFrame1 = pd.DataFrame()
    dataFrame1['texto'] = ['COMERCIAL CASA DOS FRIOS - USAR LICINIO DIAS']
    dataFrame1['marca'] = ['xpto']

    dataFrame2 = pd.DataFrame()
    dataFrame2['texto'] = ['ARES DOS ANDES - EXPORTACAO & IMPORTACAO LTDA', 'ADEGA DOS TRES IMPORTADORA', 'BODEGAS DE LOS ANDES COMERCIO DE VINHOS LTDA', 'ALL WINE IMPORTADORA']
    dataFrame2['marca'] = ['marca1', 'marca2', 'marca3', 'marca4']

    dataResult = get_dataframe_similarity(comparer=dataFrame1, finder=dataFrame2, cutoff=cutoff)
    print(dataResult)