Search for word variations

Asked

Viewed 253 times

2

I have a sentence that I need to check if it meets a rule but there may be variation in writing (accentuation, more or less spaces,...)

Example:

string fraseProcurada = "Cadastro de Usuários - SP";

if (fraseRecebida.Contains(fraseProcurada)
   //recebi a frase que procurava

however, in this example, it may occur that the user:

  • Register of users
  • user register
  • user register - são paulo
  • SP users register
  • user registration - S.P.

A certain amount of variations that actually cater to what I’m looking for. Well I thought first of making an array with these possible shapes but I do not know if there is something more certain and easy to do (like Regex).

Any suggestions?

Thank you.

  • Remove double spaces, put everything in minuscule, remove accents and then yes to do the check, would not work?

  • David, yes but that’s the simplest... what I asked is if you have a much more practical way (e.g. Regex) rather than creating an array of possibilities.

  • Without a mechanical search (using an array of possibilities for each element), the way to go is long, you would need to create a search engine like the one google has. The variations are numerous: "users register", "users' database", "são paulo - Cad. user", "users' register" and so on...

  • I believe that determining the limits of variations (considering or not possibilities that can be expressed in Regex, for example) and seeking a more restricted solution.

3 answers

2


Some comparison would not be possible, such as using "S.P" or even "São Paulo", because there is not enough generic code, such as Contains or IndexOf identify this, know for example that "SP" means the same as "São Paulo".
In such cases, you would need to create a dictionary that would inform the similarities to help.
For cases where the comparison boils down to accents, uppercase and lowercase, use all variations of CompareOptions together with IndexOf already solves many cases. It can be done like this:

static bool Comparar(string texto, string textoAComparar)
{
    var index = CultureInfo.InvariantCulture.CompareInfo.IndexOf
        (texto, textoAComparar, CompareOptions.IgnoreCase | 
         CompareOptions.IgnoreSymbols | CompareOptions.IgnoreNonSpace);
    return index != -1;
}

This will suit most cases. Here’s an example of code working: https://dotnetfiddle.net/S1Jscu

  • Ricardo, that’s what I figured. I wouldn’t have a way out of an array. I thought I could have even a form with Regex but as it can have a very large variation of phrases...the code would get big however. thanks :)

1

You can normalize the word by making all the characters in your version without accent and in low box. You can use the System.Text namespace to perform character conversion:

string s1 = new String(); 
string s2 = null;
s2 = s1.Normalize(NormalizationForm.FormC).toLowerCase();
  • 3

    But the question is related to C#

  • 1

    I’m sorry, I hadn’t noticed the tag. I’ll edit.

  • Yes, C#... I looked and saw that there is also Normalize in C# but it didn’t work (nothing happened): string t = fraseReceived.Normalize(Normalizationform.Formd); t = t.replaceAll(" p{M}", ""). toLowerCase();

  • If I’m not mistaken, Formd only indicates if the string is normalized, Formc indicates and replaces the characters if possible...

  • Marcos observed here that it is already done by the code, the question is on even the possible variations. but thanks for the help :)

  • I see, Good luck Andreia!

Show 1 more comment

1

You can add a helper class to these treatments by adding these methods to the string type and adding the methods to the treatments you want, removing all the characters you find pertinent. Take the example:

public static class StringHelper
{

    public static string RemoverAcentos(this string texto)
    {
        StringBuilder retorno = new StringBuilder();
        var arrTexto =
            texto.Normalize(NormalizationForm.FormD).ToCharArray();

        foreach (var letra in arrTexto)
        {
            if (System.Globalization.CharUnicodeInfo.GetUnicodeCategory(letra) !=
                System.Globalization.UnicodeCategory.NonSpacingMark)
                retorno.Append(letra);
        }
        return retorno.ToString();
    }

    public static string RemoverEspacamentos(this string texto)
    {       
        string retorno = texto.Replace("\t", "").Replace(" ", "");
        return retorno.ToString();
    }

    public static string RemoverCaracteresEspeciais(this string texto) {
        string retorno = texto.RemoverAcentos();
        retorno = Regex.Replace(retorno.ToLower(), @"[^a-z0-9\s*]", "");
        return retorno;
    }

}

And use as follows:

string entrada = "São Paulo SP";
string entradaNormalizada = entrada.RemoverCaracteresEspeciais()
                            .RemoverEspacamentos()
                            .ToLower();

string cadastro = "Cidade de São Paulo - SP";
string cadastroNormalizado = cadastro.RemoverCaracteresEspeciais()
                            .RemoverEspacamentos()
                            .ToLower();

bool comparacao = cadastroNormalizado.Contains(entradaNormalizada); // true

Yet this is only the first part of your journey, as even after these basic treatments you will only get positive results when the input is lower than the base if compared and are in the same order. If the entry is for example "I live in the city of São Paulo" or "SP - São Paulo". The comparison will be false.

Starting from this point you must enrich your mechanism to work with a hit score by comparing how many A terms there are in B and make your decision to validate the comparison.

But you need something more sophisticated you will need to implement a search API that meets your needs, such as Lucene or Reddog.Search.

  • Leandro, thank you but unfortunately I will not be able to apply this option in the project. I need something really simple and it’s like Voce said, it’s just a part of what I need to do.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.