Search Only the numbers of a String using Regex in python

Asked

Viewed 962 times

2

I’m trying to clear a column of a process dataset and the number of processes come in various forms:

5080847-62.2018.4.04.7100
033/2.17.0001000-7

I want to leave only numbers and remove the other characters, already tried using r"[\d+]" but only returns the first numbers before any character and the rest is deleted.

2 answers

3

import re
proc = " 5080847-62.2018.4.04.7100 033/2.17.0001000-7"
numeros = "".join(re.findall("\d+", proc))
print(numeros)

Parts of a regular expression find "blocks" of the text you want to use "d+", will be searching for a "block of digits with at least one digit". instead, I make the regular expression find the di blocks, but with the "findall" method it returns all the Matches in the string- in this case, all digits. The call to "".join glue all these sequences back into a single string.

It is also possible to filter characters with the syntax of "Generator Expression" - in this case regular expressions are not even necessary (which ends up being a problem unless):

proc = " 5080847-62.2018.4.04.7100 033/2.17.0001000-7"
numeros = "".join(char for char in proc if char.isdigit())
print(numeros)

3

You can use the shortcut \D (anything that nay be a digit), and remove them from the string:

import re

r = re.compile(r'\D')

print(r.sub('', '5080847-62.2018.4.04.7100')) # 50808476220184047100
print(r.sub('', '033/2.17.0001000-7')) # 03321700010007

The method sub exchange the parts corresponding to regex by '' (empty string). In practice, it is the same as removing all \D. The exit is:

50808476220184047100
03321700010007

If the numbers are in the same string, you can use a character class denied to prevent spaces from being replaced:

r = re.compile(r'[^\d ]')

print(r.sub('', '5080847-62.2018.4.04.7100 033/2.17.0001000-7'))

Now the regex is [^\d ] (anything that is not \d or space - the [^ indicates that I want anything that nay is inside the brackets). Note that there is a space before the ], because he’s part of what I don’t want replaced.

Thus, digits and spaces are preserved, and everything else is removed. The result is:

50808476220184047100 03321700010007

Note: the shortcut \d corresponds to any character of the Unicode category "Number, Decimal Digit". This includes not only digits from 0 to 9, but also several other characters representing digits, such as ٢ (ARABIC-INDIC DIGIT TWO), among others.

So, why default, \d includes these characters, and \D no. If your data does not contain such characters, no problem. But if you want to be more specific and consider only the digits from 0 to 9, you can use the flag ASCII, or else use [0-9] instead of \d, and [^0-9] instead of \D:

import re

r = re.compile(r'\D')
print(r.sub('', '12-34.٢٤٨')) # 1234٢٤٨

r = re.compile(r'\D', re.ASCII)
print(r.sub('', '12-34.٢٤٨')) # 1234

r = re.compile(r'[^0-9]')
print(r.sub('', '12-34.٢٤٨')) # 1234
  • Hello, first thank you for the tip.

  • I need to apply this rule to an entire column, each row containing a different number. Any suggestions? Thanks.

  • @That would be: https://www.geeksforgeeks.org/replace-values-in-pandas-dataframe-using-regex/ ?

Browser other questions tagged

You are not signed in. Login or sign up in order to post.