Here is a fairly simple solution proposal (I made in Java):
public class Teste {
protected static float checkSimilarity(String sString1, String sString2) throws Exception {
// Se as strings têm tamanho distinto, obtêm a similaridade de todas as
// combinações em que tantos caracteres quanto a diferença entre elas são
// inseridos na string de menor tamanho. Retorna a similaridade máxima
// entre todas as combinações, descontando um percentual que representa
// a diferença em número de caracteres.
if(sString1.length() != sString2.length()) {
int iDiff = Math.abs(sString1.length() - sString2.length());
int iLen = Math.max(sString1.length(), sString2.length());
String sBigger, sSmaller, sAux;
if(iLen == sString1.length()) {
sBigger = sString1;
sSmaller = sString2;
}
else {
sBigger = sString2;
sSmaller = sString1;
}
float fSim, fMaxSimilarity = Float.MIN_VALUE;
for(int i = 0; i <= sSmaller.length(); i++) {
sAux = sSmaller.substring(0, i) + sBigger.substring(i, i+iDiff) + sSmaller.substring(i);
fSim = checkSimilaritySameSize(sBigger, sAux);
if(fSim > fMaxSimilarity)
fMaxSimilarity = fSim;
}
return fMaxSimilarity - (1f * iDiff) / iLen;
// Se as strings têm o mesmo tamanho, simplesmente compara-as caractere
// a caractere. A similaridade advém das diferenças em cada posição.
} else
return checkSimilaritySameSize(sString1, sString2);
}
protected static float checkSimilaritySameSize(String sString1, String sString2) throws Exception {
if(sString1.length() != sString2.length())
throw new Exception("Strings devem ter o mesmo tamanho!");
int iLen = sString1.length();
int iDiffs = 0;
// Conta as diferenças entre as strings
for(int i = 0; i < iLen; i++)
if(sString1.charAt(i) != sString2.charAt(i))
iDiffs++;
// Calcula um percentual entre 0 e 1, sendo 0 completamente diferente e
// 1 completamente igual
return 1f - (float) iDiffs / iLen;
}
public static void main(String[] args) {
try {
System.out.println("'ABCD' vs 'ab' = " + checkSimilarity("ABCD", "ab"));
System.out.println("'cidade' vs 'cdade' = " + checkSimilarity("cidade", "cdade"));
System.out.println("'cidade' vs 'ciDade' = " + checkSimilarity("cidade", "ciDade"));
System.out.println("'cidade' vs 'cdiade' = " + checkSimilarity("cidade", "cdiade"));
System.out.println("'cidade' vs 'edadic' = " + checkSimilarity("cidade", "edadic"));
System.out.println("'cidade' vs 'CIDADE' = " + checkSimilarity("cidade", "CIDADE"));
System.out.println("'cidade' vs 'CIdADE' = " + checkSimilarity("cidade", "CIdADE"));
System.out.println("'cidade' vs 'CdADE' = " + checkSimilarity("cidade", "CdADE"));
} catch (Exception e) {
e.printStackTrace();
}
}
}
The principle of the algorithm is quite simple:
- If the strings are exactly the same size, the method
checkSimilarity
simply invoke the base method checkSimilaritySameSize
, which compares the strings to the character (left to right), counts the number of errors (different characters) and returns a percentage of errors relative to the size of the strings.
- If, on the other hand, strings have different sizes, the method
checkSimilarity
makes the same test for every possible combination where as many characters as the difference are included from the larger string at all positions of the smaller string. For example, assuming string 1 as "ABCD" (size = 4) and string 2 as "ab" (size = 2), the combinations will be "Abab", "aBCb" and "abCD".
- Among the similarities calculated between the combinations and the string of larger size (since both now have the same size), the method chooses the maximum value, but discounts an "error rate" proportional to the number of characters in the difference.
Thus, the result of the execution of the given code is the following for these examples:
'ABCD' vs 'ab' = 0.0
'cidade' vs 'cdade' = 0.8333333
'cidade' vs 'ciDade' = 0.8333333
'cidade' vs 'cdiade' = 0.6666666
'cidade' vs 'edadic' = 0.0
'cidade' vs 'CIDADE' = 0.0
'cidade' vs 'CIdADE' = 0.16666669
'cidade' vs 'CdADE' = 0.16666664
The closer to 1, the more similar the strings are, and the closer to 0, the more distinct they are. If you don’t want to differentiate maísculas from lowercase (note example in the third line), just convert both strings to maísculo (with String::toUpperCase()
) before comparing them.
Note that this solution is quite simple because it admits that errors (if any) are due to the lack of one or more characters in sequence. You can improve the algorithm so that it considers all the actual possible combinations of characters, but probably this will be like using a bazooka to kill an ant.
possible duplicate of How to make a phonetic algorithm for Brazilian Portuguese?
– Leandro Amorim
My doubt is not about phonetics. Only the spelling and not necessarily for Portuguese.
– Lucas Lima
You just want to compare if characters are missing (only)? wing = is approximately 25% smaller than home?
– Leandro Amorim
Something like that Distance Levenshtein?
– Leandro Amorim
Basically. I’m looking for a way to compare strings that may have been typed wrong by the user and I can compare the similarity to an expected value.
– Lucas Lima
Levenshtein distance may work (maybe even more than necessary). I need to test. What I need to do is just check things like words with extra letters in the middle or missing. But, I can’t predict where these "wrong" characters will be.
– Lucas Lima