You can use the shortcut \D
(anything that nay be a digit), and remove them from the string:
import re
r = re.compile(r'\D')
print(r.sub('', '5080847-62.2018.4.04.7100')) # 50808476220184047100
print(r.sub('', '033/2.17.0001000-7')) # 03321700010007
The method sub
exchange the parts corresponding to regex by ''
(empty string). In practice, it is the same as removing all \D
. The exit is:
50808476220184047100
03321700010007
If the numbers are in the same string, you can use a character class denied to prevent spaces from being replaced:
r = re.compile(r'[^\d ]')
print(r.sub('', '5080847-62.2018.4.04.7100 033/2.17.0001000-7'))
Now the regex is [^\d ]
(anything that is not \d
or space - the [^
indicates that I want anything that nay is inside the brackets). Note that there is a space before the ]
, because he’s part of what I don’t want replaced.
Thus, digits and spaces are preserved, and everything else is removed. The result is:
50808476220184047100 03321700010007
Note: the shortcut \d
corresponds to any character of the Unicode category "Number, Decimal Digit". This includes not only digits from 0 to 9, but also several other characters representing digits, such as ٢
(ARABIC-INDIC DIGIT TWO), among others.
So, why default, \d
includes these characters, and \D
no. If your data does not contain such characters, no problem. But if you want to be more specific and consider only the digits from 0 to 9, you can use the flag ASCII, or else use [0-9]
instead of \d
, and [^0-9]
instead of \D
:
import re
r = re.compile(r'\D')
print(r.sub('', '12-34.٢٤٨')) # 1234٢٤٨
r = re.compile(r'\D', re.ASCII)
print(r.sub('', '12-34.٢٤٨')) # 1234
r = re.compile(r'[^0-9]')
print(r.sub('', '12-34.٢٤٨')) # 1234
Hello, first thank you for the tip.
– Ana Carolina Pimenta
I need to apply this rule to an entire column, each row containing a different number. Any suggestions? Thanks.
– Ana Carolina Pimenta
@That would be: https://www.geeksforgeeks.org/replace-values-in-pandas-dataframe-using-regex/ ?
– hkotsubo