Regular expression C# console application

Asked

Viewed 386 times

6

I have a list of words, for example:

São Paulo
9000-000 a 9999-9999
Barigui
8000-0000 a 8999-999

I want to take only the numbers, with the same trace and separate, in another list for example, I know that the expression is this /[0-9-]/g, on the website regex. works, but doesn’t seem to work in the code:

Regex regex = new Regex(@"[0-9-]/g]");
Match match = regex.Match("9000-001 a 9999-999");
if(match.Success) 
{
   Console.WriteLine(match.Value);
}

Nothing happens, nothing prints.

2 answers

4

Also available on Ideone.

Try it like this:

using System.Text.RegularExpressions;
using System.Collections.Generic;
using System;
using System.Linq;

public class Program
{
    public static void Main(string[] args)
    {
        Regex r = new Regex(@"[\d-]+");
        var matches = r.Matches("9000-001 a 9999-999");

        foreach (var match in matches.Cast<Match>())
        {
            Console.WriteLine($"Encontrada a seguinte ocorrência: '{match.Value}'");
        }
    }
}

where, in the regular expression:

  • \d is the wildcard for digits
  • + is the quantifier for one or more occurrences

See what occurrences of the type 999- or -12409- will also be obtained with the given regular expression. To circumvent this and keep the occurrence format shown, use the regular expression \d+-\d+.

  • 2

    There’s another one with [\d-]+, which is to accept strings as "------", see. \d+-\d+ is more guaranteed even..

3


Although I don’t program professionally in C# (I have a vacancy notion), I will talk a little about the regex in question.


The website you quoted (regex.) uses the syntax /expressão/flags. That is, in your case (/[0-9-]/g) the expression is [0-9-] and the modifier/flag is g. Note that bars are only delimiters, indicating the beginning and end of the regular expression. Already the g is the "global" modifier, meaning it serves to indicate that we want to take all occurrences of the string that satisfy regex. Without this flag, only the first occurrence is returned: see here a regex with the flag g set, and note that it takes all occurrences of numbers. But if we take the g, only the first occurrence is returned.

The point is that neither the /, nor the g are part of regex itself. Of course some languages, like Javascript for example, use this syntax to create a regex directly:

// criar uma regex em JavaScript
let r = /[0-9-]/g;

In other languages, as in Python, it is not necessary to put / and g:

# criar regex em Python
import re
r = re.compile('[0-9-]')

And to catch all the occurrences, you don’t use the flag g, because there are specific methods for this, like the findall.

All this is to say that each language implements regex in a way, and not necessarily will need to put / and g when creating the expression.


From what I saw in class documentation Regex, she doesn’t need the /, because the constructor gets regex directly. That is, when you did new Regex(@"[0-9-]/g]"), created an expression that corresponds to:

  • [0-9-]: a digit from 0 to 9 or a hyphen
  • followed by the character /
  • followed by the character g
  • followed by the character ]

That’s why she didn’t find any match, because the tested string ("9000-001 a 9999-999") even has numbers and hyphens, but they are not followed by /g]. This regex would only work if it had something like "9999-999/g]", see here an example. So the first thing is remove /g] of expression.

Next, let’s see what’s left: [0-9-]. This expression corresponds to a single character, which can be a digit from 0 to 9 or a hyphen. That is, if the string has only hyphens, it will work. See here an example. And if it has only digits (no hyphen), it will also work, see.


The ideal is be more specific about what you want to get from the string. If you want "digits, followed by a hyphen, followed by more digits", do a regex that is as close as possible to that.

First, we can use the shortcut \d, which is a synonym for [0-9]. And we can also use quantifiers to specify how many digits will be considered. Examples:

  • \d{4}: exactly 4 digits
  • \d{2,4}: between 2 and 4 digits
  • \d{3,}: at least 3 digits (and no cap)
  • \d+: one or more digits (equivalent to \d{1,})

Choose the one that best fits your cases. Based on your examples, I will consider that before the hyphen are 4 digits, and after the hyphen are 3 or 4 digits (but change according to your use cases). So the expression goes like this:

\d{4}-\d{3,4}

With this, if the string has only hyphens, for example, the regex does not take. It will only consider numbers, followed by hyphens, followed by numbers. But there’s still a catch.

If the string is "1111-2222-3333", the regex will take the stretch 1111-2222, see here. Or if the string has more digits than you want, like "123456-12345", still the regex will take the stretch 3456-1234, see here.

To limit that before the first digits, and after the last, there are no other additional digits, we can use \b, which means "word Boundary" (something like "word border"), to ensure that before and after the expression there are no digits or any other alpha-numeric character:

\b\d{4}-\d{3,4}\b

With this, the regex no longer takes cases like "123456-12345", see here.


Just one more detail: from what I saw in documentation, the method Match only returns the first occurrence of regex in the string. But since you want to catch all occurrences, you can use Matches. Adapting the example of documentation, the code would look like this:

Regex regex = new Regex(@"\b\d{4}-\d{3,4}\b");
foreach (Match match in regex.Matches("9000-001 a 9999-999")) {
   Console.WriteLine("Encontrei '{0}' na posição {1}", match.Value, match.Index);
}

The exit is:

I found '9000-001' at position 0
I found '9999-999' at position 11

See here this example in Ideone.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.