Why use a regular expression "compiled" (re.Compile) in Python?

Question

Why use a regular expression "compiled" (re.Compile) in Python?

Asked 3 years, 6 months ago

Viewed 215 times

11

In a another question of this site, I noticed that although the two responses made use of regular expressions, different paths were taken:

One of them used the function re.search to carry out the search operation.
The other used the function re.compile to, as I understand, create a compiled regular expression and, from the returned object, use a method such as search.

Faced with this situation, I have some doubts:

What would be a "compiled" regular expression (as the function name suggests)?
What is the advantage of using regular expressions in this way?
Is there any downside?

2

good question Luiz, however before someone answers you can satisfy the curiosity: https://stackoverflow.com/questions/452104/is-it-worth-using-pythons-compile

– Miguel

2020/12/11 at 17:09
Hehe, I’ve seen it, but I decided to bring it here, I ended up taking the trigger of the other question. : D

– Luiz Felipe

2020/12/11 at 17:09
You did well, we have no reference to this in Sopt

– Miguel

2020/12/11 at 17:10
As I do not have enough knowledge to answer your question completely I will just comment, It is not exactly the expressions that are compiled, but rather most of the operations and their patterns. As the subject of RE is very much tied to compilers, it would not be as computationally feasible to perform most of these operations at a high level, so in order to optimize, these standards and operations are compiled in bytecodes and under the C-Frames. With this, Maybe the process isn’t so perfectly adjusted. But that’s kind of how it happens.

– Mr. Satan

2020/12/11 at 17:32
1

@Joaorobertomendes The expression is compiled yes, and transformed into an instance of re.Pattern

– hkotsubo

2020/12/11 at 18:00
1

Related: https://softwareengineering.stackexchange.com/a/410008

– hkotsubo

2020/12/15 at 11:06

Show 1 more comment

1 answer

Browser other questions tagged python regex

You are not signed in. Login or sign up in order to post.

by hkotsubo • **55,826** points · Answer 1 · 2020-12-11T17:47:52+00:00

Every regular expression, whether in Python or any other language, is compiled: the Parsing to know if the syntax is correct, if the expression is valid, to get all of its tokens, etc (the details, of course, may vary according to the implementation). In the end, everything is transformed into some internal structure, containing all the information necessary for it to do the matching.

In the case of Python, a compiled regex results in an instance of re.Pattern.

According to the documentation, working - using compile before or just match direct - will be similar. IE, the 2 forms below will work in the same way:

prog = re.compile(pattern)
result = prog.match(string)

# ou
result = re.match(pattern, string)

The same goes for methods search, findall, etc. Everyone has the option to receive the expression as a parameter, or can be called from the pre-compiled regex.

The difference that the documentation cites is:

but using re.compile() and saving the Resulting regular Expression Object for reuse is more Efficient when the Expression will be used several times in a single program.

That is, to use compile is more efficient if the expression is reused several times. But in the case of using only once, it will not make a significant difference.

Only the same documentation also says the following:

The Compiled versions of the Most recent Patterns passed to re.compile() and the module-level matching functions are cached, so Programs that use only a few regular Expressions at a time needn’t Worry about compiling regular Expressions.

That is, the most recent expressions are curly internally, then programs that use few expressions and/or do not reuse so much should not worry too much about it.

And just remember that in the end, the expression is always compiled, what changes is when this happens: re.match(expressao, string) ends up compiling the expression, in case it is not in the already mentioned cache.

In this question from Soen there are several responses arguing about it, and one of them mentions the readability it brings when using compile, Because it might make it clearer that that expression will be reused several times. I won’t repeat everything there is, but it is a good source to supplement the subject.

As a curiosity, I did a quick test:

import re
texto = "# 44,739 % of all cache refs 12,345 lorem ipsum # 98,736 % etc 45,678 blablbla"
exp = r'# (\d+,\d+) %'

from timeit import timeit

# executa 1 milhão de vezes cada teste
params = { 'number' : 1000000, 'globals': globals() }

# usando a expressão compilada
print(timeit('r.findall(texto)', setup='r = re.compile(exp)', **params))
# não usando a expressão compilada
print(timeit('re.findall(exp, texto)', **params))

On my machine, on average, the version with compile took between 0.5 and 0.8 seconds, while the other option took between 1.3 and 1.7 seconds. That is, even with the cache internal, use compile still presented a gain. Testing in the Ideone.com and in the Repl.it, the results were similar (the version with compile was faster).

My guess is that it happens because match need to do the lookup in the cache, so even if regex is already there, it still has this additional cost to search for it. Already using compile, i use the pre-compiled instance directly, no need to search in the cache.

But as always, what counts in the end is testing your specific case to see if it makes a difference or not.

Playing a little with the cache

Just out of curiosity, I ran a little test with the cache (attention, I did this in Python 3.7, so in different versions this might not work, since it depends on the internal implementation details of the module re, that has even changed several times).

Finally, in Python 3.7 cache de regex is a dictionary, so I first created a subclass of dict, to log when an element is added to or obtained from it:

class DictWatch(dict):
    def __init__(self, *args):
        dict.__init__(self, args)

    def __getitem__(self, key):
        val = dict.__getitem__(self, key)
        print('obtendo item no cache:', key)
        return val

    def __setitem__(self, key, val):
        print(f'guardando item no cache: {key}={val}')
        dict.__setitem__(self, key, val)

And then I overwrite the cache, And I do an initial test just to see if it works:

import re
re._MAXCACHE = 3 # mudar tamanho máximo para 3
re._cache = DictWatch() # sobrescreve o cache com meu dicionário acima

print('\ncompilar abc')
re.compile('abc')
print('cache:', re._cache.keys())

print('\ncompilar abcd')
re.compile('abcd')
print('cache:', re._cache.keys())

print('\ncompilar abc de novo')
re.compile('abc')
print('cache:', re._cache.keys())

print('\ncompilar abcde')
re.compile('abcde')
print('cache:', re._cache.keys())

print('\ncompilar abcdef')
re.compile('abcdef')
print('cache:', re._cache.keys())

The exit is:

compilar abc
guardando item no cache: (<class 'str'>, 'abc', 0)=re.compile('abc')
cache: dict_keys([(<class 'str'>, 'abc', 0)])

compilar abcd
guardando item no cache: (<class 'str'>, 'abcd', 0)=re.compile('abcd')
cache: dict_keys([(<class 'str'>, 'abc', 0), (<class 'str'>, 'abcd', 0)])

compilar abc de novo
obtendo item no cache: (<class 'str'>, 'abc', 0)
cache: dict_keys([(<class 'str'>, 'abc', 0), (<class 'str'>, 'abcd', 0)])

compilar abcde
guardando item no cache: (<class 'str'>, 'abcde', 0)=re.compile('abcde')
cache: dict_keys([(<class 'str'>, 'abc', 0), (<class 'str'>, 'abcd', 0), (<class 'str'>, 'abcde', 0)])

compilar abcdef
guardando item no cache: (<class 'str'>, 'abcdef', 0)=re.compile('abcdef')
cache: dict_keys([(<class 'str'>, 'abcd', 0), (<class 'str'>, 'abcde', 0), (<class 'str'>, 'abcdef', 0)])

Note that when compiling abc for the second time, regex is retrieved from the cache because it was already there. And when the maximum size is reached, abc is removed, to place the last compiled regex.

Now doing the test with re.match:

import re
re._MAXCACHE = 3 # mudar tamanho máximo para 3
re._cache = DictWatch() # sobrescreve o cache com meu dicionário acima

print('match')
re.match('abc', 'xyz')
print('cache:', re._cache.keys())

print('\nmatch de novo')
re.match('abc', '123')
print('cache:', re._cache.keys())

The exit is:

match
guardando item no cache: (<class 'str'>, 'abc', 0)=re.compile('abc')
cache: dict_keys([(<class 'str'>, 'abc', 0)])

match de novo
obtendo item no cache: (<class 'str'>, 'abc', 0)
cache: dict_keys([(<class 'str'>, 'abc', 0)])

Indicating that the first time re.match is called, regex is compiled and stored in the cache. The second time it is no longer compiled, as it is now retrieved from the cache.

Now using compile:

import re
re._MAXCACHE = 3 # mudar tamanho máximo para 3
re._cache = DictWatch() # sobrescreve o cache com meu dicionário acima

print('compile')
r = re.compile('abc')
print('cache:', re._cache.keys())

print('\nmatch')
r.match('xyz')
print('cache:', re._cache.keys())

print('\nre.match passando a regex compilada')
re.match(r, '123')
print('cache:', re._cache.keys())

print('\nre.match passando a regex como string')
re.match('abc', '123')
print('cache:', re._cache.keys())

The exit is:

compile
guardando item no cache: (<class 'str'>, 'abc', 0)=re.compile('abc')
cache: dict_keys([(<class 'str'>, 'abc', 0)])

match
cache: dict_keys([(<class 'str'>, 'abc', 0)])

re.match passando a regex compilada
cache: dict_keys([(<class 'str'>, 'abc', 0)])

re.match passando a regex como string
obtendo item no cache: (<class 'str'>, 'abc', 0)
cache: dict_keys([(<class 'str'>, 'abc', 0)])

Notice how the direct use of the precompiled instance does not search in the cache (even if we pass it as a parameter to re.match), while passing the expression as a string to re.match, the cache search is done.