How to extract only numbers from a string?

Asked

Viewed 20,874 times

7

I want to extract only the numbers of a CPF that is in a string in this format;

111.222.333-44

You have to return only:

11122233344

  • If I’m not mistaken, \d includes letters as well.

  • Use only \d.

  • I don’t understand what you want. In a string "abc0d1e2" you either "012" or "0"?

  • @Ericlemes, already clarified in the question.

4 answers

19


I got:

That code worked:

String.Join("", System.Text.RegularExpressions.Regex.Split(stringAqui, @"[^\d]"))
  • ^ within a set([]) means denial.
  • \d shortcut to 0-9, i.e., numbers;

In a nutshell to regex means everything that is not number;

  • Another option: Método Regex.Replace (String, String)

  • @Marconi sure, you can

7

For such cases there is no need to have complication, do with substring():

cpf.Substring(0, 3) + cpf.Substring(4, 3) + cpf.Substring(8, 3) + cpf.Substring(12, 2)

Even better:

string.Concat(cpf.AsSpan(0, 3), cpf.AsSpan(4, 3), cpf.AsSpan(8, 3), cpf.AsSpan(12, 2))

And you can still use a better syntax, but the performance will be more or less:

cpf[0..3] + cpf[4..7] + cpf[8..11] + cpf[12..14]

But although the question can be interpreted to solve this only for the CPF, it can also be done to clear and leave only numeric digits of any type of data where the format is not known. Then I’d have to do this:

var sb = new StringBuilder(cpf.Length);
foreach (var letra in cpf) if (Char.IsDigit(letra)) sb.Append(letra);
formatado = sb.ToString();

It’s the same effect as the LINQ posted by Samuel (if not having to worry about performance is a good solution and shorter, have to analyze which is worth using) which is a good answer.

I made a code comparing the options posted here:

using System;
using static System.Console;
using System.Diagnostics;
using  System.Text.RegularExpressions;
using System.Linq;
using System.Text;

public class Program {
    public static void Main() {
        const int total = 1_000_000;
        var cpf = "111.222.333-44";
        var formatado = "";
        var sw = Stopwatch.StartNew();
        for (var i = 0; i < total; i++) {
            formatado = cpf.Substring(0, 3) + cpf.Substring(4, 3) + cpf.Substring(8, 3) + cpf.Substring(12, 2);
        }
        sw.Stop();
        WriteLine(sw.ElapsedMilliseconds);
        sw.Restart();
        for (var i = 0; i < total; i++) {
            formatado = string.Concat(cpf.AsSpan(0, 3), cpf.AsSpan(4, 3), cpf.AsSpan(8, 3), cpf.AsSpan(12, 2));
        }
        sw.Stop();
        WriteLine(sw.ElapsedMilliseconds);
        sw.Restart();
        for (var i = 0; i < total; i++) {
            formatado = cpf[0..3] + cpf[4..7] + cpf[8..11] + cpf[12..14];
        }
        sw.Stop();
        WriteLine(sw.ElapsedMilliseconds);
        sw.Restart();
        for (var i = 0; i < total; i++) {
            var sb = new StringBuilder(cpf.Length);
            foreach (var letra in cpf) if (Char.IsDigit(letra)) sb.Append(letra);
            formatado = sb.ToString();
        }
        sw.Stop();
        WriteLine(sw.ElapsedMilliseconds);
        sw.Restart();
        for (var i = 0; i < total; i++) {
            formatado = string.Join("", cpf.ToCharArray().Where(Char.IsDigit));
        }
        sw.Stop();
        WriteLine(sw.ElapsedMilliseconds);
        sw.Restart();
        for (var i = 0; i < total; i++) {
            formatado = String.Join("", Regex.Split(cpf, @"[^\d]"));
        }
        sw.Stop();
        WriteLine(sw.ElapsedMilliseconds);
        sw.Restart();
        for (var i = 0; i < total; i++) {
            Regex r = new Regex(@"\d+");            
            var result = "";
            foreach (Match m in r.Matches(cpf)) result += m.Value;
            formatado = result;
        }
        sw.Stop();
        WriteLine(sw.ElapsedMilliseconds);
    }
}

Behold working in the .NET Fiddle. Also put on the Github for future reference.

I did the test on my machine because these online IDE sites are unreliable and the best method is using Benchmark.NET, and gave:

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.329 (2004/?/20H1)
Intel Core i7-7700K CPU 4.20GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.1.20155.7
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.12005, CoreFX 5.0.20.12005), X64 RyuJIT
  DefaultJob : .NET Core 5.0.0 (CoreCLR 5.0.20.12005, CoreFX 5.0.20.12005), X64 RyuJIT


|        Method |        Mean |    Error |   StdDev |  Gen 0 | Gen 1 | Gen 2 | Allocated |
|-------------- |------------:|---------:|---------:|-------:|------:|------:|----------:|
|     Substring |    55.52 ns | 0.248 ns | 0.232 ns | 0.0421 |     - |     - |     176 B |
|          Span |    24.67 ns | 0.097 ns | 0.086 ns | 0.0115 |     - |     - |      48 B |
|         Range |    53.06 ns | 0.260 ns | 0.231 ns | 0.0421 |     - |     - |     176 B |
|       Foreach |    44.22 ns | 0.255 ns | 0.238 ns | 0.0363 |     - |     - |     152 B |
|          Linq |   191.17 ns | 0.752 ns | 0.703 ns | 0.1147 |     - |     - |     480 B |
|    RegexSplit |   396.71 ns | 2.431 ns | 2.274 ns | 0.0763 |     - |     - |     320 B |
|  RegexReplace |   366.67 ns | 0.901 ns | 0.704 ns | 0.0110 |     - |     - |      48 B |
|  RegexMatches | 1,780.55 ns | 6.015 ns | 5.332 ns | 0.8678 |     - |     - |    3632 B |
| RegexMatchesG |   818.62 ns | 3.186 ns | 2.980 ns | 0.5465 |     - |     - |    2288 B |

I could have tested different environments, but I didn’t think it was worth it.

Whoever wants the code:

using System;
using System.Text.RegularExpressions;
using System.Linq;
using System.Text;
using BenchmarkDotNet.Running;
using BenchmarkDotNet.Attributes;

[MemoryDiagnoser]
public class Program {
    public string cpf = "111.222.333-44";
    public static void Main(string[] args) => BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);
    [Benchmark]
    public string Substring() => cpf.Substring(0, 3) + cpf.Substring(4, 3) + cpf.Substring(8, 3) + cpf.Substring(12, 2);
    [Benchmark]
    public string Span() => string.Concat(cpf.AsSpan(0, 3), cpf.AsSpan(4, 3), cpf.AsSpan(8, 3), cpf.AsSpan(12, 2));
    [Benchmark]
    public string Range() => cpf[0..3] + cpf[4..7] + cpf[8..11] + cpf[12..14];
    [Benchmark]
    public string Foreach() {
        var sb = new StringBuilder(cpf.Length);
        foreach (var letra in cpf) if (Char.IsDigit(letra)) sb.Append(letra);
        return sb.ToString();
    }
    [Benchmark]
    public string Linq() => string.Join("", cpf.ToCharArray().Where(Char.IsDigit));
    [Benchmark]
    public string RegexSplit() => String.Join("", Regex.Split(cpf, @"[^\d]"));
    [Benchmark]
    public string RegexReplace() => Regex.Replace(cpf, @"[^\d]", "");
    [Benchmark]
    public string RegexMatches() {
        Regex r = new Regex(@"\d+");
        var result = "";
        foreach (Match m in r.Matches(cpf)) result += m.Value;
        return result;
    }
    [Benchmark]
    public string RegexMatchesG() {
        Regex r = new Regex(@"\G");
        var result = "";
        foreach (Match m in r.Matches(cpf)) result += m.Value;
        return result;
    }
}

I put in the Github for future reference.

Clearing the digits was faster and more generic than the solution that picks up the individual pieces. There is still room to make the specific composition using Span for better performance, but cleaning with a simple loop is very competitive.

It gave me the idea of trying to create a StringBuilder in stack to be faster still, but I know he would have limitations, he couldn’t be exactly a Builder, would have to be a buffer and with relatively small maximum size, so could not use in any situation.

Not that it’s wrong, but I wouldn’t use Regex under any circumstances. It is a less readable solution always, with absurdly bad performance and does not always work as one expects. I’m a fan of Jamie Zawinski’s quote

When you try to solve a problem with Regex you happen to have two problems.

I won’t go into detail about it here because it’s not the focus of the question.

  • Researching more about performance using Asspan, I found this article, so I decided to test your Asspan implementation with the implementation you have in the article. As a result, the article implementation was better, I believe it is because string.Concat is managed by GC. + 1

  • @Samuelrenangonçalvesvaz I do not know if I understand, the AsSpan() was the best ever. In this case I used the Concat()because it was the requirement of this problem, there was no way not to do so, have no use, but would not give the same desired result, I wanted to always give the same result for anyone to say that in them was faster but gave another result even if similar. Anyway show the test you did, you can do something I didn’t even see. Thank you. Someone didn’t like it and it was negative too.

  • @Maniero, I went to redo the test I did in "debug" mode and realized I was only considering up to 0.. 3 in this case I failed the implementation. Congratulations on the answer.

  • @Samuelrenangonçalvesvaz is got to do in Release.

  • 2

    @Samuelrenangonçalvesvaz I improved to talk about cleaning instead of composition. Who knows the negative was because the person thought it was worth only if it was cleaning.

  • was great. About the part that talks of regex, I fully agree, pity that there is no way to add +2 in the reply.

Show 1 more comment

3

A solution without use Regex:

var cpf = "111.222.333-44"

string.Join("", cpf.ToCharArray().Where(Char.IsDigit));

reference Soen

1

Another solution:

    [TestMethod]
    public void TestGetOnlyNumbers()
    {
        Regex r = new Regex(@"\d+");            
        string result = "";
        foreach (Match m in r.Matches("111.222.333-44"))
            result += m.Value;

        Assert.AreEqual("11122233344", result);
    }

The trick is that you need to give multiple Chequebooks on what is number. If you use the Regex.Match method, only take the first one (111).

  • you can use the flag g who continues to seek.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.