To Reply by @Priscilla is sufficient and in fact the best choice for the vast majority of cases. However, if your crowler need to handle money in different formats, it may be useful for you to consider the location/language of the accessed page. One way to do this is by using the package locale
.
Here is an example of illustrative code:
import re
import locale
#--------------------------------------------------
def extractMonetaryValue(text):
cs = locale.localeconv()['currency_symbol']
expr = '{}[ ]*[0-9.,]+'.format(cs.replace('$', '\\$'))
m = re.search(expr, text)
if m:
s = m.group(0).replace(cs, '').replace(' ', '')
return locale.atof(s)
else:
return 0.0
#--------------------------------------------------
s = 'Este teste testa um valor (por exemplo: R$ 560.200,40) expresso em Reais.'
locale.setlocale(locale.LC_ALL, 'ptb_bra') # 'pt_BR' se não estiver no Windows
n = extractMonetaryValue(s)
print('Para "{}" o valor é: {}'.format(s, n))
s = 'This test tests a value (let us say U$ 482,128.33) given in US Dolars.'
locale.setlocale(locale.LC_ALL, 'enu_usa') # 'en_US' se não estiver no Windows
n = extractMonetaryValue(s)
print('Para "{}" o valor é: {}'.format(s, n))
In this code, the main function is extractMonetaryValue
. She gets some text and searches it for a subtext that contains, necessarily, the currency symbol of the country/language set up (followed by zero or more spaces), and then a number composed of digits, dots and commas. To do so, she uses a regular expression well-rounded: she does not care whether the numerical "format" is correct or not, as this will be done later, by calling locale.atof
(exception ValueError
if the format is incorrect according to the country/language set).
The output of the above code is as follows::
Para "Este teste testa um valor (por exemplo: R$ 560.200,40) expresso em Reais." o valor é: 560200.4
Para "This test tests a value (let us say U$ 482,128.33) given in US Dolars." o valor é: 482128.33
Notice how the numbers printed at the end use both the dot as decimal separator (after all, they are values represented as float
internally, in the same way regardless of the origin treated).
P.S.:
- To detect the
locale
operating system standard, use locale.getdefaultlocale()
- To detect the
locale
from a web page, make sure she has this infomation on the tag
lang
.
If she doesn’t, you’ll need to try to infer the language. For your
(Wow! Hehe) Lucky, there’s this Google language detector port
to Python called
langdetect
.
It would be good to know about this: http://answall.com/q/44715/101
– Maniero
But do you want it to be for float? Or int? The examples you present are converted to int
– Miguel