Every regular expression, whether in Python or any other language, is compiled: the Parsing to know if the syntax is correct, if the expression is valid, to get all of its tokens, etc (the details, of course, may vary according to the implementation). In the end, everything is transformed into some internal structure, containing all the information necessary for it to do the matching.
In the case of Python, a compiled regex results in an instance of re.Pattern
.
According to the documentation, working - using compile
before or just match
direct - will be similar. IE, the 2 forms below will work in the same way:
prog = re.compile(pattern)
result = prog.match(string)
# ou
result = re.match(pattern, string)
The same goes for methods search
, findall
, etc. Everyone has the option to receive the expression as a parameter, or can be called from the pre-compiled regex.
The difference that the documentation cites is:
but using re.compile()
and saving the Resulting regular Expression Object for reuse is more Efficient when the Expression will be used several times in a single program.
That is, to use compile
is more efficient if the expression is reused several times. But in the case of using only once, it will not make a significant difference.
Only the same documentation also says the following:
The Compiled versions of the Most recent Patterns passed to re.compile()
and the module-level matching functions are cached, so Programs that use only a few regular Expressions at a time needn’t Worry about compiling regular Expressions.
That is, the most recent expressions are curly internally, then programs that use few expressions and/or do not reuse so much should not worry too much about it.
And just remember that in the end, the expression is always compiled, what changes is when this happens: re.match(expressao, string)
ends up compiling the expression, in case it is not in the already mentioned cache.
In this question from Soen there are several responses arguing about it, and one of them mentions the readability it brings when using compile
, Because it might make it clearer that that expression will be reused several times. I won’t repeat everything there is, but it is a good source to supplement the subject.
As a curiosity, I did a quick test:
import re
texto = "# 44,739 % of all cache refs 12,345 lorem ipsum # 98,736 % etc 45,678 blablbla"
exp = r'# (\d+,\d+) %'
from timeit import timeit
# executa 1 milhão de vezes cada teste
params = { 'number' : 1000000, 'globals': globals() }
# usando a expressão compilada
print(timeit('r.findall(texto)', setup='r = re.compile(exp)', **params))
# não usando a expressão compilada
print(timeit('re.findall(exp, texto)', **params))
On my machine, on average, the version with compile
took between 0.5 and 0.8 seconds, while the other option took between 1.3 and 1.7 seconds. That is, even with the cache internal, use compile
still presented a gain. Testing in the Ideone.com and in the Repl.it, the results were similar (the version with compile
was faster).
My guess is that it happens because match
need to do the lookup in the cache, so even if regex is already there, it still has this additional cost to search for it. Already using compile
, i use the pre-compiled instance directly, no need to search in the cache.
But as always, what counts in the end is testing your specific case to see if it makes a difference or not.
Playing a little with the cache
Just out of curiosity, I ran a little test with the cache (attention, I did this in Python 3.7, so in different versions this might not work, since it depends on the internal implementation details of the module re
, that has even changed several times).
Finally, in Python 3.7 cache de regex is a dictionary, so I first created a subclass of dict
, to log when an element is added to or obtained from it:
class DictWatch(dict):
def __init__(self, *args):
dict.__init__(self, args)
def __getitem__(self, key):
val = dict.__getitem__(self, key)
print('obtendo item no cache:', key)
return val
def __setitem__(self, key, val):
print(f'guardando item no cache: {key}={val}')
dict.__setitem__(self, key, val)
And then I overwrite the cache, And I do an initial test just to see if it works:
import re
re._MAXCACHE = 3 # mudar tamanho máximo para 3
re._cache = DictWatch() # sobrescreve o cache com meu dicionário acima
print('\ncompilar abc')
re.compile('abc')
print('cache:', re._cache.keys())
print('\ncompilar abcd')
re.compile('abcd')
print('cache:', re._cache.keys())
print('\ncompilar abc de novo')
re.compile('abc')
print('cache:', re._cache.keys())
print('\ncompilar abcde')
re.compile('abcde')
print('cache:', re._cache.keys())
print('\ncompilar abcdef')
re.compile('abcdef')
print('cache:', re._cache.keys())
The exit is:
compilar abc
guardando item no cache: (<class 'str'>, 'abc', 0)=re.compile('abc')
cache: dict_keys([(<class 'str'>, 'abc', 0)])
compilar abcd
guardando item no cache: (<class 'str'>, 'abcd', 0)=re.compile('abcd')
cache: dict_keys([(<class 'str'>, 'abc', 0), (<class 'str'>, 'abcd', 0)])
compilar abc de novo
obtendo item no cache: (<class 'str'>, 'abc', 0)
cache: dict_keys([(<class 'str'>, 'abc', 0), (<class 'str'>, 'abcd', 0)])
compilar abcde
guardando item no cache: (<class 'str'>, 'abcde', 0)=re.compile('abcde')
cache: dict_keys([(<class 'str'>, 'abc', 0), (<class 'str'>, 'abcd', 0), (<class 'str'>, 'abcde', 0)])
compilar abcdef
guardando item no cache: (<class 'str'>, 'abcdef', 0)=re.compile('abcdef')
cache: dict_keys([(<class 'str'>, 'abcd', 0), (<class 'str'>, 'abcde', 0), (<class 'str'>, 'abcdef', 0)])
Note that when compiling abc
for the second time, regex is retrieved from the cache because it was already there. And when the maximum size is reached, abc
is removed, to place the last compiled regex.
Now doing the test with re.match
:
import re
re._MAXCACHE = 3 # mudar tamanho máximo para 3
re._cache = DictWatch() # sobrescreve o cache com meu dicionário acima
print('match')
re.match('abc', 'xyz')
print('cache:', re._cache.keys())
print('\nmatch de novo')
re.match('abc', '123')
print('cache:', re._cache.keys())
The exit is:
match
guardando item no cache: (<class 'str'>, 'abc', 0)=re.compile('abc')
cache: dict_keys([(<class 'str'>, 'abc', 0)])
match de novo
obtendo item no cache: (<class 'str'>, 'abc', 0)
cache: dict_keys([(<class 'str'>, 'abc', 0)])
Indicating that the first time re.match
is called, regex is compiled and stored in the cache. The second time it is no longer compiled, as it is now retrieved from the cache.
Now using compile
:
import re
re._MAXCACHE = 3 # mudar tamanho máximo para 3
re._cache = DictWatch() # sobrescreve o cache com meu dicionário acima
print('compile')
r = re.compile('abc')
print('cache:', re._cache.keys())
print('\nmatch')
r.match('xyz')
print('cache:', re._cache.keys())
print('\nre.match passando a regex compilada')
re.match(r, '123')
print('cache:', re._cache.keys())
print('\nre.match passando a regex como string')
re.match('abc', '123')
print('cache:', re._cache.keys())
The exit is:
compile
guardando item no cache: (<class 'str'>, 'abc', 0)=re.compile('abc')
cache: dict_keys([(<class 'str'>, 'abc', 0)])
match
cache: dict_keys([(<class 'str'>, 'abc', 0)])
re.match passando a regex compilada
cache: dict_keys([(<class 'str'>, 'abc', 0)])
re.match passando a regex como string
obtendo item no cache: (<class 'str'>, 'abc', 0)
cache: dict_keys([(<class 'str'>, 'abc', 0)])
Notice how the direct use of the precompiled instance does not search in the cache (even if we pass it as a parameter to re.match
), while passing the expression as a string to re.match
, the cache search is done.
good question Luiz, however before someone answers you can satisfy the curiosity: https://stackoverflow.com/questions/452104/is-it-worth-using-pythons-compile
– Miguel
Hehe, I’ve seen it, but I decided to bring it here, I ended up taking the trigger of the other question. : D
– Luiz Felipe
You did well, we have no reference to this in Sopt
– Miguel
As I do not have enough knowledge to answer your question completely I will just comment, It is not exactly the expressions that are compiled, but rather most of the operations and their patterns. As the subject of RE is very much tied to compilers, it would not be as computationally feasible to perform most of these operations at a high level, so in order to optimize, these standards and operations are compiled in bytecodes and under the C-Frames. With this, Maybe the process isn’t so perfectly adjusted. But that’s kind of how it happens.
– Mr. Satan
@Joaorobertomendes The expression is compiled yes, and transformed into an instance of
re.Pattern
– hkotsubo
Related: https://softwareengineering.stackexchange.com/a/410008
– hkotsubo