3
I have a data frame where I need to compare how much the contents of two columns are similar.
For example: coluna a = “José Luiz da Silva”
and coluna b = “José L. Silva”
. How can I indicate that column a and column b are similar?
3
I have a data frame where I need to compare how much the contents of two columns are similar.
For example: coluna a = “José Luiz da Silva”
and coluna b = “José L. Silva”
. How can I indicate that column a and column b are similar?
1
Here’s a possible solution, in Python:
#-*- coding: utf-8 -*-
from unidecode import unidecode
ignore_list = ['de', 'do', 'da', 'dos', 'das']
def parse_name(full_name):
name_list = full_name.split() # Separa cada nome
new_name_list = []
for name in name_list: # Percorre cada nome
name = name.strip('.') # Remove pontos
name = name.lower() # Converte todas as letras em minúsculas
if name in ignore_list: # Remove preposições
continue
name = unidecode(name.decode('utf8')) # Remove acentos (necessita da biblioteca 'unidecode')
new_name_list.append(name)
return new_name_list
def is_similar(a, b):
a = parse_name(a)
b = parse_name(b)
if len(a) != len(b): # Se o número de palavras for diferente, retorna falso
return False
for x, y in zip(a, b):
if (len(x) == 1) or (len(y) == 1): # Se uma das palavras possuir apenas uma letra...
if x[0] != y[0]: #...compara apenas a primeira letra
return False
else: # Caso contrário...
if x != y: #...compara a palavra toda
return False
return True # Se todas as palavras forem iguais, retorna verdadeiro
Example of use:
a = 'José Luiz da Silva'
b = 'José L. Silva'
print is_similar(a, b) # Retorna True
In this solution, the function is_similar()
returns only true
or false
. Depending on your need, it might be interesting to think of a more flexible metric that returns a distance measure. For example:
'José L. Silva'
and 'José Luiz da Silva'
would have distance 0 (would be considered equal);'José Silva'
and 'José Luiz da Silva'
would have a small distance value (would be considered similar);'José Silva'
and 'Maria Souza'
would have a large distance value (would be considered quite different).1
(TL;DR)
# Testando taxa de similaridade
from difflib import SequenceMatcher
def sml(x,y):
return SequenceMatcher(None, x, y).ratio()
x = 'José Luiz da Silva'
y = 'José L. Silva'
msg = "Taxa de similaridade "
print(msg, 'entre x e y: ', sml(x,y) )
print(msg, 'entre x e x: ', sml(x,x) )
Output:
Taxa de similaridade entre x e y: 0.7741935483870968
Taxa de similaridade entre x e x: 1.0
Browser other questions tagged python pandas
You are not signed in. Login or sign up in order to post.