How to compare if the contents of two columns string of a data frame are similar

Asked

Viewed 1,079 times

3

I have a data frame where I need to compare how much the contents of two columns are similar.

For example: coluna a = “José Luiz da Silva” and coluna b = “José L. Silva”. How can I indicate that column a and column b are similar?

2 answers

1

Here’s a possible solution, in Python:

#-*- coding: utf-8 -*-
from unidecode import unidecode

ignore_list = ['de', 'do', 'da', 'dos', 'das']

def parse_name(full_name):
    name_list = full_name.split() # Separa cada nome
    new_name_list = []
    for name in name_list: # Percorre cada nome
        name = name.strip('.') # Remove pontos
        name = name.lower() # Converte todas as letras em minúsculas
        if name in ignore_list: # Remove preposições
            continue
        name = unidecode(name.decode('utf8')) # Remove acentos (necessita da biblioteca 'unidecode')
        new_name_list.append(name)
    return new_name_list

def is_similar(a, b):
    a = parse_name(a)
    b = parse_name(b)
    if len(a) != len(b): # Se o número de palavras for diferente, retorna falso
        return False
    for x, y in zip(a, b):
        if (len(x) == 1) or (len(y) == 1): # Se uma das palavras possuir apenas uma letra...
            if x[0] != y[0]: #...compara apenas a primeira letra
                return False
        else: # Caso contrário...
            if x != y: #...compara a palavra toda
                return False
    return True # Se todas as palavras forem iguais, retorna verdadeiro

Example of use:

a = 'José Luiz da Silva'
b = 'José L. Silva'
print is_similar(a, b) # Retorna True

In this solution, the function is_similar() returns only true or false. Depending on your need, it might be interesting to think of a more flexible metric that returns a distance measure. For example:

  • Names like 'José L. Silva' and 'José Luiz da Silva' would have distance 0 (would be considered equal);
  • Names like 'José Silva' and 'José Luiz da Silva' would have a small distance value (would be considered similar);
  • Names like 'José Silva' and 'Maria Souza' would have a large distance value (would be considered quite different).

1

(TL;DR)

Testing the rate of similarity between two strings:

# Testando taxa de similaridade
from difflib import SequenceMatcher
def sml(x,y):
    return SequenceMatcher(None, x, y).ratio()

x = 'José Luiz da Silva'
y = 'José L. Silva'
msg = "Taxa de similaridade "

print(msg, 'entre x e y: ', sml(x,y) )
print(msg, 'entre x e x: ', sml(x,x) )

Output:

Taxa de similaridade  entre x e y:  0.7741935483870968
Taxa de similaridade  entre x e x:  1.0

Run the code in repl.it.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.