How to do a spell check in C#?

Asked

Viewed 3,085 times

19

I need to do an analysis of the words contained in a database. The analysis consists of promoting a spell check only, showing a report on the screen (gridview) with incorrect words.

I never developed anything like it, I wanted a light.

I can start the example with this:

string[] palavrasParaCorrigir = {"batata", "conoira", "cebola", "pimentao", "beterraba"};
  • 2

    See if it suits you: Aspell.Net (Note: the Portuguese version is licensed via GPL - which can be problematic to integrate into your system depending on the case)

  • You can work with a Word . dll to perform such fixes. Check this link out: http://www.codeproject.com/Articles/2469/SpellCheck-net-spell-checking-parsing-using-C

  • Just remember that the answers are interesting to use With Winforms. WPF has spell checking by default on its components where this is relevant.

  • Your answer sounds interesting, but today it is totally dependent on the link. And if he’s off the air, the answer becomes useless to the reader. If you improve it to include more detailed explanations on how to use Word DLL (no problem copying/translating Codeproject data as you cite the original source) and even some snippets of code, you get my +1. :)

2 answers

14

Probably existing libraries like Hunspell (already quoted in the chosen answer) or Aspell will solve your problem quickly: these libraries exist for several languages and are used in several programs.

But if you want to dig a little deeper: there is an excellent article by Peter Norvig (Google’s Director of Research) on the subject: http://norvig.com/spell-correct.html

Sure, it’s in English, but it explains in a basic way how the Google broker works when we use the search engine and it suggests a fix.

In summary: the system is based on a dictionary with Hamming Code verification of distance 2. In the case of the article and the examples, the dictionary is a file with enough text, where these are correctly written. Peter Norvig used several Shakespeare texts for this.

When the user enters a word, the program takes this word, and sees if it exists in the dictionary. If so, the word is correct.

If it does not exist, it generates several mutants (variations with error), of this word using the following techniques:

  • Change the position of the next letters;
  • Take one of the letters for each position;
  • Insert a letter in each position;
  • Delete a letter at each position.

From this list of mutants, he will check if any of them exist in the dictionary. The one that exists in the greatest number, will be the correct.

In the sample program, if you still can’t find a correct word, it takes every word from the mutant list, and it generates new mutants. And again see if any of them exist in the dictionary.

In the end of article, has the code of the program in several languages (at the time, I wrote a version in Java and Groovy) but you will see versions for virtually all languages, including two versions in C#.

The only additional detail is that you may have to tweak the source code so that the letter range is not just a-z, but also include accented letters, as we use in English.

And of course, you’ll need a dictionary in Portuguese. Or, optionally, if your list is composed only of products, for example, you can use, instead of the dictionary, your product list.

  • Very good response, and the article is great. But it takes a good deal of time to achieve the expected effect.

  • 1

    There is a certain endearment of what "comes from the guys from Google", even with reason... But "the guys from the Portuguese of Brazil" are us (!), and there is a certain consensus that 2 layers solve: in the first, with high recall is used Metaphone; and in the second, to elevate Precision, the "Fuzzy" algorithms cited are used, but a good "similar text" (based on hamming distance) to solve.

  • @Peterkrauss But is the article and algorithms exposed by Peter Norvig are exactly based on Hamming distance (in case, it checks errors with distance 2).

  • He confirmed my suspicions :-) A sign that his explanation (I only mentioned it) besides didactic is correct. However, do not confuse... I drew attention to the two layers: the first is that it reduces the universe (on bases with thousands of "phrases"), and requires to know a priori if it is an "en-BR universe". Hamming (time consuming and impossible to use as an indexer) is only applied in the second layer, that is, to a small/tiny set and therefore treatable.

11


If you don’t mind the fact that the spell checker is under the GPL license, a good solution would be to use the Nhunspell.

You can get one of its latest versions here. After adding Nhunspell.dll to your project, just use the following code to do the verification:

using (Hunspell hunspell = new Hunspell("pt_br.aff", "pt_br.dic"))
{
    bool ortografia = hunspell.Spell("palavra a ser verificada");

    if (ortografia == false) //A palavra não está escrita corretamente.
    {
        /*...*/
    }

    List<string> sugestoes = hunspell.Suggest("palavra a ser verificada"); //Definindo lista de sugestões (palavras possíveis).
}

Note: The affix and dictionary files (.Aff and .dic) can be found here.

  • 1

    I accepted this as correct because it is the simplest alternative to implement.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.