Regular expression keyword marking

Asked

Viewed 121 times

0

Good morning! I have been two weeks evolving in the treatment of texts in Sqlserver (through CLR) with the use of regular expressions. Now my challenge is to "mark" names of municipalities so that they are not "treated". It would be like a list of stopwors, but in reverse. My idea is to put the "_" sign at the beginning and end of each county name. Using the code:

using System;
using System.Text.RegularExpressions;

public class Exemplo
{
    public static void Main()
    {
        string pattern = @"(\bRio de Janeiro\b|\bManaus\b)";
        string[] titles = { "As cidades de Manaus e Rio de Janeiro" };
        string replacement = "_$1_";
        foreach (string title in titles)
            Console.WriteLine(Regex.Replace(title, pattern, replacement));
    }
}

would result in: "The cities of _Manaus_ and _Rio de Janeiro_"

How, in a performative way, I could have the result: "The cities of _Manaus_ and _Rio_de_janeiro_"?

Or another way to ignore a word relationship while processing text.

With the replacement of the internal spaces by _, In a future treatment (I’m using a stemmer) after the removal of the underline, the result would be "rio de janeiro" and "Manaus" and not "rio dar Jane" and "man", what happens to my code currently.

Grateful

  • 1

    You also want to replace the spaces inside the "keywords" by _? In markup languages normally internal spaces should be kept identical. Change them to underscore is a change of contents, not just change of showing off. So for all of the markup language engines I know, the correct one would be _Rio de Janeiro_, nay _Rio_de_Janeiro_

  • Hello! Yes, with the replacement of the internal spaces by _. In a future treatment I would remove the preposition "of" and later remove the underline. The way it would look: The cities Manaus and Rio Janeiro", when what I hope would be "The cities Manaus and Rio de Janeiro" (this is a simplistic treatment, at the moment)

  • Not directly related, but the regex could be "simplified" to @"\b(Rio de Janeiro|Manaus)\b". Not that it will leave much more performative, since regex is not the most performative thing in the world... Anyway, if you leave the _ just at the beginning and end, I think it’s much easier to treat (because just look for everything that is between _), already leaving several _ in the middle of the string will make it much more difficult to detect where the markup begins and ends. Anyway, it would be enough to make another replacement - as for the performance, only measuring to know if it meets you or not

  • The case is that I’m applying a stemmer, with the treatments, "rio de janeiro" would be "rio dar Jane" and "Manaus" would be "man" and lose the context

  • Reginaldo, I believe all this information should be in the question, so I suggest you click on [Edit] and put the whole context and explanation of what exactly you are trying to do, what the input and the end result, etc. For this code of regex seems to be just part of trying to solve a bigger problem - actually, the question, as it is, gives the impression of being a typical case of XY problem

  • I’ll do that, thank you. This is the first time I’ve used this feature.

Show 1 more comment
No answers

Browser other questions tagged

You are not signed in. Login or sign up in order to post.