1st Option - Fuzzy Lookup Add-In for Excel
You can use the Fuzzy Lookup Add-In for Excel or create Fuzzy logic itself in a program. Search for Fuzzy Search or in English Fuzzy search/lookup.
Fuzzy Lookup Add-in for Excel was developed by Microsoft Research and a fuzzy correspondence of text data in Microsoft Excel. It can be used to identify diffuse duplicate lines within a single table or for confusing junctions in similar rows between two different tables. The correspondence is robust for a wide variety of errors, including misspellings, abbreviations, synonyms, and added/missing data. For example, it can detect that the lines "Mr. Andrew Hill", "Hill, Andrew R." and "Andy Hill" refer to the same underlying entity, returning to the similarity score with each match. While the default setting works well for a wide variety of textual data, such as product names or customer addresses, correspondence can also be customized to specific domains or languages. (Google Translation Translator.)
2nd Option - VBA Code
This code has been tested in Excel VBA and the credits are on this page.
Public Function Similarity(ByVal String1 As String, _
ByVal String2 As String, _
Optional ByRef RetMatch As String, _
Optional min_match = 1) As Single
Dim b1() As Byte, b2() As Byte
Dim lngLen1 As Long, lngLen2 As Long
Dim lngResult As Long
If UCase(String1) = UCase(String2) Then
Similarity = 1
Else:
lngLen1 = Len(String1)
lngLen2 = Len(String2)
If (lngLen1 = 0) Or (lngLen2 = 0) Then
Similarity = 0
Else:
b1() = StrConv(UCase(String1), vbFromUnicode)
b2() = StrConv(UCase(String2), vbFromUnicode)
lngResult = Similarity_sub(0, lngLen1 - 1, _
0, lngLen2 - 1, _
b1, b2, _
String1, _
RetMatch, _
min_match)
Erase b1
Erase b2
If lngLen1 >= lngLen2 Then
Similarity = lngResult / lngLen1
Else
Similarity = lngResult / lngLen2
End If
End If
End If
End Function
Private Function Similarity_sub(ByVal start1 As Long, ByVal end1 As Long, _
ByVal start2 As Long, ByVal end2 As Long, _
ByRef b1() As Byte, ByRef b2() As Byte, _
ByVal FirstString As String, _
ByRef RetMatch As String, _
ByVal min_match As Long, _
Optional recur_level As Integer = 0) As Long
'* CALLED BY: Similarity *(RECURSIVE)
Dim lngCurr1 As Long, lngCurr2 As Long
Dim lngMatchAt1 As Long, lngMatchAt2 As Long
Dim I As Long
Dim lngLongestMatch As Long, lngLocalLongestMatch As Long
Dim strRetMatch1 As String, strRetMatch2 As String
If (start1 > end1) Or (start1 < 0) Or (end1 - start1 + 1 < min_match) _
Or (start2 > end2) Or (start2 < 0) Or (end2 - start2 + 1 < min_match) Then
Exit Function '(exit if start/end is out of string, or length is too short)
End If
For lngCurr1 = start1 To end1
For lngCurr2 = start2 To end2
I = 0
Do Until b1(lngCurr1 + I) <> b2(lngCurr2 + I)
I = I + 1
If I > lngLongestMatch Then
lngMatchAt1 = lngCurr1
lngMatchAt2 = lngCurr2
lngLongestMatch = I
End If
If (lngCurr1 + I) > end1 Or (lngCurr2 + I) > end2 Then Exit Do
Loop
Next lngCurr2
Next lngCurr1
If lngLongestMatch < min_match Then Exit Function
lngLocalLongestMatch = lngLongestMatch
RetMatch = ""
lngLongestMatch = lngLongestMatch _
+ Similarity_sub(start1, lngMatchAt1 - 1, _
start2, lngMatchAt2 - 1, _
b1, b2, _
FirstString, _
strRetMatch1, _
min_match, _
recur_level + 1)
If strRetMatch1 <> "" Then
RetMatch = RetMatch & strRetMatch1 & "*"
Else
RetMatch = RetMatch & IIf(recur_level = 0 _
And lngLocalLongestMatch > 0 _
And (lngMatchAt1 > 1 Or lngMatchAt2 > 1) _
, "*", "")
End If
RetMatch = RetMatch & Mid$(FirstString, lngMatchAt1 + 1, lngLocalLongestMatch)
lngLongestMatch = lngLongestMatch _
+ Similarity_sub(lngMatchAt1 + lngLocalLongestMatch, end1, _
lngMatchAt2 + lngLocalLongestMatch, end2, _
b1, b2, _
FirstString, _
strRetMatch2, _
min_match, _
recur_level + 1)
If strRetMatch2 <> "" Then
RetMatch = RetMatch & "*" & strRetMatch2
Else
RetMatch = RetMatch & IIf(recur_level = 0 _
And lngLocalLongestMatch > 0 _
And ((lngMatchAt1 + lngLocalLongestMatch < end1) _
Or (lngMatchAt2 + lngLocalLongestMatch < end2)) _
, "*", "")
End If
Similarity_sub = lngLongestMatch
End Function
That a simple test is seen in the following image:
Which means they both have a similarity of 94.1176%.
This function (Similarity
) can be used to compare a database and from a minimum setting point duplicates can be found by similarity in percentage. This analysis can be time-consuming for a large base, requiring good quality programming with emphasis on performance if you want a better processing time.
3rd Option - Levenshtein Distance
Credits: sysmod
Private Declare Function GetTickCount Lib "kernel32.dll" () As Long
Option Explicit
'Option Base 0 assumed
Sub testLevenshtein()
Dim s1 As String, s2 As String, lTime As Long, i As Long, teste As Long
s1 = "Luiz Carlos Silva"
s2 = "Luis Carlos Silva"
lTime = GetTickCount()
teste = LevenshteinB(s1, s2)
Debug.Print GetTickCount - lTime; " ms" ' 234 ms
Debug.Print teste
End Sub
'POB: fn with byte array and inline MIN code is 17 times faster
Function LevenshteinB(ByVal string1 As String, ByVal string2 As String) As Long
'http://www.sysmod.com/modLevenshtein.bas
Dim i As Long, j As Long, ByteArray1() As Byte, ByteArray2() As Byte
Dim string1_length As Long
Dim string2_length As Long
Dim distance() As Long
Dim min1 As Long, min2 As Long, min3 As Long
Const UseWSMIN = False
string1_length = Len(string1)
string2_length = Len(string2)
ReDim distance(string1_length, string2_length)
ByteArray1 = string1
ByteArray2 = string2
For i = 0 To string1_length
distance(i, 0) = i
Next
For j = 0 To string2_length
distance(0, j) = j
Next
For i = 1 To string1_length
For j = 1 To string2_length
'Unicode, compare both even and odd bytes
If ByteArray1((i - 1) * 2) = ByteArray2((j - 1) * 2) And _
ByteArray1((i - 1) * 2 + 1) = ByteArray2((j - 1) * 2 + 1) Then
distance(i, j) = distance(i - 1, j - 1)
Else
If UseWSMIN Then
distance(i, j) = WorksheetFunction.Min _
(distance(i - 1, j) + 1, _
distance(i, j - 1) + 1, _
distance(i - 1, j - 1) + 1)
Else
' spell it out, 50 times faster than worksheetfunction.min
min1 = distance(i - 1, j) + 1
min2 = distance(i, j - 1) + 1
min3 = distance(i - 1, j - 1) + 1
If min1 <= min2 And min1 <= min3 Then
distance(i, j) = min1
ElseIf min2 <= min1 And min2 <= min3 Then
distance(i, j) = min2
Else
distance(i, j) = min3
End If
End If
End If
Next
Next
LevenshteinB = distance(string1_length, string2_length)
End Function
Which returns the image result:
In this test the information that appears is of how many characters are different between the two Strings and the time for a comparison that is less than 1 ms, for 100 iterations took 16 ms.
EDIT:
An example code to work with the similarity function:
Sub testSimilaridade()
Dim LastRow As Long, i As Long
Dim Arr() As Variant, NewArr() As Variant
Dim Names As Worksheet, ws As Worksheet
Dim Similaridade As Single, Limite As Single
Set Names = ThisWorkbook.Worksheets("Names")
SheetKiller ("NewNames")
Set ws = Sheets.Add
ws.Name = "NewNames"
LastRow = Names.Cells(Names.Rows.Count, "A").End(xlUp).Row
Arr = Names.Range("a2", Names.Cells(LastRow, 1))
NewArr = Arr
Limite = 0.9
For i = LBound(Arr) To UBound(Arr)
If Not i = UBound(Arr) Then x = i + 1
For k = x To UBound(Arr)
Similaridade = Similarity(CStr(Arr(i, 1)), CStr(Arr(k, 1)))
If Similaridade > Limite Then
NewArr(k, 1) = ""
End If
Next k
Next i
For i = LBound(Arr) To UBound(Arr)
ws.Cells(i, 1) = NewArr(i, 1)
Next i
ws.Range("A:A").Sort Key1:=ws.Range("A2"), Order1:=xlAscending, Header:=xlYes
End Sub
Public Function SheetKiller(Name As String)
Dim s As Worksheet, t As String
Dim i As Long, k As Long
k = Sheets.Count
For i = k To 1 Step -1
t = Sheets(i).Name
If t = Name Then
Application.DisplayAlerts = False
Sheets(i).Delete
Application.DisplayAlerts = True
End If
Next i
End Function
Results:
With this example data:
The result is as follows:
Another Test of Result
A test can be done with the database "nickname-and-diminutive-Names-lookup", that with 2266 entries, the output was 1509 records. Performing in less than 60 seconds.
To parse the data, make a copy and paste of the raw data in cell A2. And then run the following code:
Sub test()
Dim ws As Worksheet, source As Worksheet
Dim LastRowA As Long, LastRowB As Long, i As Long, k As Long
Dim strCell As String
SheetKiller ("Names")
Set ws = Sheets.Add
ws.Name = "Names"
Set source = ThisWorkbook.Worksheets("Planilha1")
LastRowA = source.Cells(source.Rows.Count, "A").End(xlUp).Row
For i = 1 To LastRowA
strCell = CStr(source.Cells(i, 1))
Count = Len(strCell) - Len(Replace(strCell, ",", ""))
For k = 1 To Count
LastRowB = ws.Cells(ws.Rows.Count, "A").End(xlUp).Row
ws.Cells(LastRowB + 1, 1) = EXTRACTELEMENT(strCell, k, ",")
Next k
Next i
End Sub
Function EXTRACTELEMENT(Txt As String, n, Separator As String) As String
On Error GoTo ErrHandler:
EXTRACTELEMENT = Split(Application.Trim(Mid(Txt, 1)), Separator)(n - 1)
Exit Function
ErrHandler:
' error handling code
EXTRACTELEMENT = CVErr(xlErrNA)
On Error GoTo 0
End Function
Do you have any database with the names considered correct?
– danieltakeshi
Good afternoon! not have, this base is extracted from a system of care, operators type the data of the attended, but we work with street population or with high vulnerabilities, so it is not possible to require any documents, which has led to a gigantic number of duplicities.
– Bruno S.