How to verify similarity between strings?

Asked

Viewed 20,688 times

24

It’s very common to compare strings, which is usually done by comparing equality. However, today I have arisen the need to compare the likeness between two strings, so I can compare how similar they are.

For example:

"City" is different from "city", but is similar (consider c != C).

"Cdade" is different from "city", but is similar (assuming user typed wrong).

"ciddade" is different from "city", but is similar (the same thing).

My question is: how do I verify the similarity between strings? Is there an algorithm ready to do that?

I’m looking for a way to do that:

if(checkSimilarity("cidade","cdade") > 0.8)
{
    // as duas strings são muito parecidas
    // sendo que checkSimilarity() retornando 1 para strings iguais
}

No matter the programming language used in the answer. My question is more related to the algorithm.

  • My doubt is not about phonetics. Only the spelling and not necessarily for Portuguese.

  • 1

    You just want to compare if characters are missing (only)? wing = is approximately 25% smaller than home?

  • 3

    Something like that Distance Levenshtein?

  • Basically. I’m looking for a way to compare strings that may have been typed wrong by the user and I can compare the similarity to an expected value.

  • Levenshtein distance may work (maybe even more than necessary). I need to test. What I need to do is just check things like words with extra letters in the middle or missing. But, I can’t predict where these "wrong" characters will be.

Show 1 more comment

5 answers

12


I believe that the levenshtein’s algorithm be a solution that meets well what you are intending to do, since with it it is possible to know the number of modifications through which one word needs to pass until it matches the other.

A possible implementation of it (in C++) would be:

#include <iostream.h>
#include <string.h>
#include <math.h>
#include <algorithm>

using namespace std;

int VerificarSimilaridade(string palavra1, string palavra2)
{
    int tam1 = palavra1.size();
    int tam2 = palavra2.size();
    int verif[tam1 + 1][tam2 + 1];

    // Se uma das palavras tiver coprimento igual a zero, é necessária uma modificação do tamanho da outra:
    if (tam1 == 0)
        return tam2;

    if (tam2 == 0)
        return tam1;

    // Atribuir ordem numérica resultante das palavras ao vetor de modo, respectivamente, "vertical" e "horizontal":
    int i = 0;
    while(i <= tam1)
        verif[i][0] = i++;

    int j = 0; 
    while(j <= tam2)
        verif[0][j] = j++;

    // Verificação:
    for (int i = 1; i <= tam1; i++)
    {
        for (int j = 1; j <= tam2; j++)
        {
            // Definindo custos de modificação (deleção, inserção e substituição):
            int custo = (palavra2[j - 1] == palavra1[i - 1]) ? 0 : 1;

            verif[i][j] = min(
                min(verif[i - 1][j] + 1, verif[i][j - 1] + 1), 
                verif[i - 1][j - 1] + custo);
        }
    }

    return verif[tam1][tam2];
 }

 int main()
 {
     string pala1, pala2;

     cout << "Informe a primeira palavra: " << endl;
     cin >> pala1;
     cout << "Informe a segunda palavra: " << endl;
     cin >> pala2;

     cout << "O numero de modificacoes necessarias para que as duas palavras se igualem e: " 
          << VerificarSimilaridade(pala1, pala2) << endl;

     system("pause");
     return 0;   
 }

Moreover, from the number of modifications you can create your own criterion to quantify the level of similarity between two strings; something interesting would be to calculate the percentage of change according to the length of the words, so through a reference degree it would be possible to verify that 5 modifications can have high value in a string with 7 letters and at the same time, tiny in a with 200 for example.

  • 1

    +1. I also made an implementation in Javascript: https://github.com/CsRenan/Programming-Doodles/blob/master/levenshtein.js

  • There are some very similar algorithms also that check the similarity of the strings and return a floating point determining whether the similarity is good or not. Follow the link: https://github.com/tdebatty/java-string-similarity

9

There is an article(How to Strike a Match) created by Simon White related to what you want, he wrote an article about an algorithm that compares adjacent pairs of characters that should be useful to you.

Some advantages over other algorithms (such as Soundex, Levenshtein, among others, olhe aqui) are they:

  1. A true reflection of the lexical similarity - strings with small differences should be recognized as similar.
  2. Robustness to word order changes - two strings that contain the same words, but in a different order, must be recognized as being similar. On the other hand, if one string is just a random anagram of the characters contained in the other, then it should (usually) be recognised as different.
  3. Language independence - the algorithm must work not only in English, but in many different languages.

For example, France must be similar to Français and República da França and República da França should be similar to both República Francesa and Republique Francaise. (Free translation)

Code below developed in C#.

/// <summary>
/// This class implements string comparison algorithm
/// based on character pair similarity
/// Source: http://www.catalysoft.com/articles/StrikeAMatch.html
/// </summary>
public class SimilarityTool
{
    /// <summary>
    /// Compares the two strings based on letter pair matches
    /// </summary>
    /// <param name="str1"></param>
    /// <param name="str2"></param>
    /// <returns>The percentage match from 0.0 to 1.0 where 1.0 is 100%</returns>
    public double CompareStrings(string str1, string str2)
    {
        List<string> pairs1 = WordLetterPairs(str1.ToUpper());
        List<string> pairs2 = WordLetterPairs(str2.ToUpper());

        int intersection = 0;
        int union = pairs1.Count + pairs2.Count;

        for (int i = 0; i < pairs1.Count; i++)
        {
            for (int j = 0; j < pairs2.Count; j++)
            {
                if (pairs1[i] == pairs2[j])
                {
                    intersection++;
                    pairs2.RemoveAt(j);//Must remove the match to prevent "GGGG" from appearing to match "GG" with 100% success

                    break;
                }
            }
        }

        return (2.0 * intersection) / union;
    }

    /// <summary>
    /// Gets all letter pairs for each
    /// individual word in the string
    /// </summary>
    /// <param name="str"></param>
    /// <returns></returns>
    private List<string> WordLetterPairs(string str)
    {
        List<string> AllPairs = new List<string>();

        // Tokenize the string and put the tokens/words into an array
        string[] Words = Regex.Split(str, @"\s");

        // For each word
        for (int w = 0; w < Words.Length; w++)
        {
            if (!string.IsNullOrEmpty(Words[w]))
            {
                // Find the pairs of characters
                String[] PairsInWord = LetterPairs(Words[w]);

                for (int p = 0; p < PairsInWord.Length; p++)
                {
                    AllPairs.Add(PairsInWord[p]);
                }
            }
        }

        return AllPairs;
    }

    /// <summary>
    /// Generates an array containing every 
    /// two consecutive letters in the input string
    /// </summary>
    /// <param name="str"></param>
    /// <returns></returns>
    private string[] LetterPairs(string str)
    {
        int numPairs = str.Length - 1;

        string[] pairs = new string[numPairs];

        for (int i = 0; i < numPairs; i++)
        {
            pairs[i] = str.Substring(i, 2);
        }

        return pairs;
    }
}

7

Here is a fairly simple solution proposal (I made in Java):

public class Teste {

    protected static float checkSimilarity(String sString1, String sString2) throws Exception {

        // Se as strings têm tamanho distinto, obtêm a similaridade de todas as
        // combinações em que tantos caracteres quanto a diferença entre elas são
        // inseridos na string de menor tamanho. Retorna a similaridade máxima
        // entre todas as combinações, descontando um percentual que representa
        // a diferença em número de caracteres.
        if(sString1.length() != sString2.length()) {
            int iDiff = Math.abs(sString1.length() - sString2.length());
            int iLen = Math.max(sString1.length(), sString2.length());
            String sBigger, sSmaller, sAux;

            if(iLen == sString1.length()) {
                sBigger = sString1;
                sSmaller = sString2;
            }
            else {
                sBigger = sString2;
                sSmaller = sString1;
            }

            float fSim, fMaxSimilarity = Float.MIN_VALUE;
            for(int i = 0; i <= sSmaller.length(); i++) {
                sAux = sSmaller.substring(0, i) + sBigger.substring(i, i+iDiff) + sSmaller.substring(i);
                fSim = checkSimilaritySameSize(sBigger,  sAux);
                if(fSim > fMaxSimilarity)
                    fMaxSimilarity = fSim;
            }
            return fMaxSimilarity - (1f * iDiff) / iLen;

        // Se as strings têm o mesmo tamanho, simplesmente compara-as caractere
        // a caractere. A similaridade advém das diferenças em cada posição.
        } else
            return checkSimilaritySameSize(sString1, sString2);
    }

    protected static float checkSimilaritySameSize(String sString1, String sString2) throws Exception {

        if(sString1.length() != sString2.length())
            throw new Exception("Strings devem ter o mesmo tamanho!");

        int iLen = sString1.length();
        int iDiffs = 0;

        // Conta as diferenças entre as strings
        for(int i = 0; i < iLen; i++)
            if(sString1.charAt(i) != sString2.charAt(i))
                iDiffs++;

        // Calcula um percentual entre 0 e 1, sendo 0 completamente diferente e
        // 1 completamente igual
        return 1f - (float) iDiffs / iLen;
    }

    public static void main(String[] args) {
        try {
            System.out.println("'ABCD' vs 'ab' = " + checkSimilarity("ABCD", "ab"));
            System.out.println("'cidade' vs 'cdade' = " + checkSimilarity("cidade", "cdade"));
            System.out.println("'cidade' vs 'ciDade' = " + checkSimilarity("cidade", "ciDade"));
            System.out.println("'cidade' vs 'cdiade' = " + checkSimilarity("cidade", "cdiade"));
            System.out.println("'cidade' vs 'edadic' = " + checkSimilarity("cidade", "edadic"));
            System.out.println("'cidade' vs 'CIDADE' = " + checkSimilarity("cidade", "CIDADE"));
            System.out.println("'cidade' vs 'CIdADE' = " + checkSimilarity("cidade", "CIdADE"));
            System.out.println("'cidade' vs 'CdADE' = " + checkSimilarity("cidade", "CdADE"));
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

The principle of the algorithm is quite simple:

  • If the strings are exactly the same size, the method checkSimilarity simply invoke the base method checkSimilaritySameSize, which compares the strings to the character (left to right), counts the number of errors (different characters) and returns a percentage of errors relative to the size of the strings.
  • If, on the other hand, strings have different sizes, the method checkSimilarity makes the same test for every possible combination where as many characters as the difference are included from the larger string at all positions of the smaller string. For example, assuming string 1 as "ABCD" (size = 4) and string 2 as "ab" (size = 2), the combinations will be "Abab", "aBCb" and "abCD".
  • Among the similarities calculated between the combinations and the string of larger size (since both now have the same size), the method chooses the maximum value, but discounts an "error rate" proportional to the number of characters in the difference.

Thus, the result of the execution of the given code is the following for these examples:

'ABCD' vs 'ab' = 0.0
'cidade' vs 'cdade' = 0.8333333
'cidade' vs 'ciDade' = 0.8333333
'cidade' vs 'cdiade' = 0.6666666
'cidade' vs 'edadic' = 0.0
'cidade' vs 'CIDADE' = 0.0
'cidade' vs 'CIdADE' = 0.16666669
'cidade' vs 'CdADE' = 0.16666664

The closer to 1, the more similar the strings are, and the closer to 0, the more distinct they are. If you don’t want to differentiate maísculas from lowercase (note example in the third line), just convert both strings to maísculo (with String::toUpperCase()) before comparing them.

Note that this solution is quite simple because it admits that errors (if any) are due to the lack of one or more characters in sequence. You can improve the algorithm so that it considers all the actual possible combinations of characters, but probably this will be like using a bazooka to kill an ant.

2

The most commonly used algorithm for this is the famous "soundex". It consists basically of a code with a letter and 3 numbers. There are variations of this algorithm, according to the phonetics of each country. Tahiti, for example, which has only 8 consonants, will have greater difficulty comparing words.

The following xyyy representation, common in the implementation of the English language, (one consonant and three vowels) would thus be:

First choose a table to use, or create your own, based on language. Example:

  • 1 B F P V
  • 2 C G J K
  • 2 Q S X Y
  • 3 D T
  • 4 L LH
  • 5 m n
  • 6 R RR

Another example table (I use a variation of this):

  • 1 T D
  • 2 N NH
  • 3 M
  • 4 R RR RH
  • 5 L LH
  • 6 G J
  • 7 C K
  • 8 F W V
  • 9 P B
  • 0 S SS SH

Now, to add up, use the principles below:

  • every constant will have a corresponding number;
  • every vowel and punctuation will be ignored;
  • the first letter found will be represented by x;
  • after filling the 3 consonants, the additional consonants will be ignored;
  • duplicated consonants will be unified;
  • short names will have zeros added until they complete the xyyy representation.

Once done, just add up the values.

For example, American, it will have the following value M-472. Why? Vowels are ignored: mrcns First letter: M Next consonants and their values: 4(R)7(C)2(N)

If the person types Americcans, it will have the following value M-472: Ignored vowels: mrccns First letter: M Proxima: 4(R)77(C, two consonants are transformed into 1)2(N).

That is, the sound of Americans and Americcans, America, Americanized, are identical. Now, just create your own variation, as your need.

  • 1

    In the same family of phonetic algorithms worth mentioning the various implementations of Metaphone, algorithm typically used in spell checkers.

  • Good tip. I didn’t know Metaphone

0

In java you have the contains() method of the String class makes the comparison to see if one string is present in the other:

String a = "Teste123";
String b = "123";

if(a.contains(b) == true){
System.out.println("Encontrado!");
}

Another alternative would be to take all the letters of the strings and compare the letters separately:

String a = "Cidade";
String b = "idade";
int auxiliar;

for(int i = 0;i<a.lenght();i++){
if(existeLetra(a.substring(i),b)){
   auxiliar++;
}
}

if(a.lenght - auxiliar < 2){
System.out.println("Parecido!");
}

public boolean existeLetra(char letra,String palavra){
return palavra.contains(letra); // retorna true or false
}

Just to compare difference in uppercase and minuscule letters you can use the ignore method().

String a = "Cidade";
String b = "cidade";

if(b.ignoreCase().equals(a.ignoreCase)){
   System.out.println("OK!");
}
  • your name is Paganini for real or is just a nick?

  • It’s actually Paganini, because?

  • 1

    For nothing, it reminded me the composer Nicolo Paganini, cool..

Browser other questions tagged

You are not signed in. Login or sign up in order to post.