Search for word variations

Question

Search for word variations

Asked 7 years, 4 months ago

Viewed 253 times

2

I have a sentence that I need to check if it meets a rule but there may be variation in writing (accentuation, more or less spaces,...)

Example:

string fraseProcurada = "Cadastro de Usuários - SP";

if (fraseRecebida.Contains(fraseProcurada)
   //recebi a frase que procurava

however, in this example, it may occur that the user:

Register of users
user register
user register - são paulo
SP users register
user registration - S.P.

A certain amount of variations that actually cater to what I’m looking for. Well I thought first of making an array with these possible shapes but I do not know if there is something more certain and easy to do (like Regex).

Any suggestions?

Thank you.

Remove double spaces, put everything in minuscule, remove accents and then yes to do the check, would not work?

– David Dias

2018/04/04 at 13:56
David, yes but that’s the simplest... what I asked is if you have a much more practical way (e.g. Regex) rather than creating an array of possibilities.

– aa_sp

2018/04/04 at 14:01
Without a mechanical search (using an array of possibilities for each element), the way to go is long, you would need to create a search engine like the one google has. The variations are numerous: "users register", "users' database", "são paulo - Cad. user", "users' register" and so on...

– Diego Rafael Souza

2018/04/04 at 14:10
I believe that determining the limits of variations (considering or not possibilities that can be expressed in Regex, for example) and seeking a more restricted solution.

– Diego Rafael Souza

2018/04/04 at 14:13

3 answers

2

Some comparison would not be possible, such as using "S.P" or even "São Paulo", because there is not enough generic code, such as Contains or IndexOf identify this, know for example that "SP" means the same as "São Paulo".
In such cases, you would need to create a dictionary that would inform the similarities to help.
For cases where the comparison boils down to accents, uppercase and lowercase, use all variations of CompareOptions together with IndexOf already solves many cases. It can be done like this:

static bool Comparar(string texto, string textoAComparar)
{
    var index = CultureInfo.InvariantCulture.CompareInfo.IndexOf
        (texto, textoAComparar, CompareOptions.IgnoreCase | 
         CompareOptions.IgnoreSymbols | CompareOptions.IgnoreNonSpace);
    return index != -1;
}

This will suit most cases. Here’s an example of code working: https://dotnetfiddle.net/S1Jscu

Ricardo, that’s what I figured. I wouldn’t have a way out of an array. I thought I could have even a form with Regex but as it can have a very large variation of phrases...the code would get big however. thanks :)

– aa_sp

2018/04/04 at 16:34

Browser other questions tagged c# asp.net

You are not signed in. Login or sign up in order to post.

by Marcos de Andrade • **497** points · Answer 1 · 2018-04-04T14:03:43+00:00

1

You can normalize the word by making all the characters in your version without accent and in low box. You can use the System.Text namespace to perform character conversion:

string s1 = new String(); 
string s2 = null;
s2 = s1.Normalize(NormalizationForm.FormC).toLowerCase();

3

But the question is related to C#

– Leandro Angelo

2018/04/04 at 14:10
1

I’m sorry, I hadn’t noticed the tag. I’ll edit.

– Marcos de Andrade

2018/04/04 at 14:12
Yes, C#... I looked and saw that there is also Normalize in C# but it didn’t work (nothing happened): string t = fraseReceived.Normalize(Normalizationform.Formd); t = t.replaceAll(" p{M}", ""). toLowerCase();

– aa_sp

2018/04/04 at 14:13
If I’m not mistaken, Formd only indicates if the string is normalized, Formc indicates and replaces the characters if possible...

– Marcos de Andrade

2018/04/04 at 14:19
Marcos observed here that it is already done by the code, the question is on even the possible variations. but thanks for the help :)

– aa_sp

2018/04/04 at 16:32
I see, Good luck Andreia!

– Marcos de Andrade

2018/04/04 at 17:36

Show 1 more comment

by Leandro Angelo • **9,330** points · Answer 2 · 2018-04-04T14:37:30+00:00

You can add a helper class to these treatments by adding these methods to the string type and adding the methods to the treatments you want, removing all the characters you find pertinent. Take the example:

public static class StringHelper
{

    public static string RemoverAcentos(this string texto)
    {
        StringBuilder retorno = new StringBuilder();
        var arrTexto =
            texto.Normalize(NormalizationForm.FormD).ToCharArray();

        foreach (var letra in arrTexto)
        {
            if (System.Globalization.CharUnicodeInfo.GetUnicodeCategory(letra) !=
                System.Globalization.UnicodeCategory.NonSpacingMark)
                retorno.Append(letra);
        }
        return retorno.ToString();
    }

    public static string RemoverEspacamentos(this string texto)
    {       
        string retorno = texto.Replace("\t", "").Replace(" ", "");
        return retorno.ToString();
    }

    public static string RemoverCaracteresEspeciais(this string texto) {
        string retorno = texto.RemoverAcentos();
        retorno = Regex.Replace(retorno.ToLower(), @"[^a-z0-9\s*]", "");
        return retorno;
    }

}

And use as follows:

string entrada = "São Paulo SP";
string entradaNormalizada = entrada.RemoverCaracteresEspeciais()
                            .RemoverEspacamentos()
                            .ToLower();

string cadastro = "Cidade de São Paulo - SP";
string cadastroNormalizado = cadastro.RemoverCaracteresEspeciais()
                            .RemoverEspacamentos()
                            .ToLower();

bool comparacao = cadastroNormalizado.Contains(entradaNormalizada); // true

Yet this is only the first part of your journey, as even after these basic treatments you will only get positive results when the input is lower than the base if compared and are in the same order. If the entry is for example "I live in the city of São Paulo" or "SP - São Paulo". The comparison will be false.

Starting from this point you must enrich your mechanism to work with a hit score by comparing how many A terms there are in B and make your decision to validate the comparison.

But you need something more sophisticated you will need to implement a search API that meets your needs, such as Lucene or Reddog.Search.