If the file always follows this structure (assuming, for example, that there will not be a "config" block within another), you can simply read the file line by line.
When you find the line containing config system interface
, you mark that a configuration block has been started. From there, just go saving all the subsequent lines, until you find a line that only has end
:
configuracoes = [] # guarda todos os trechos 'config system interface' encontrados
with open('arquivo_de_configuracao.txt') as file:
dentro_do_bloco = False # verifica se está dentro de um bloco de config desejado
config = [] # guarda a config atual
for linha in file: # lê o arquivo linha a linha
linha = linha.strip('\n') # retirar quebra de linha
if linha == 'config system interface':
dentro_do_bloco = True # iniciou o bloco
config.append(linha)
elif dentro_do_bloco:
config.append(linha)
if linha == 'end':
dentro_do_bloco = False # terminou o bloco
# junta tudo e guarda na lista de configs encontradas
configuracoes.append('\n'.join(config))
config = []
# imprime as configurações encontradas
for c in configuracoes:
print(c)
This solution assumes that there are no nested configuration blocks (and therefore no end
within the block itself, as it determines that the current config has ended).
Also note the use of with
: this ensures that the file will be closed, even if an error occurs during execution.
Regex
The above solution I consider the simplest option, but how you used the tag regex in the question, follows an alternative, using the module re
:
import re
with open('/tmp/arq.txt') as file:
conteudo = file.read()
r = re.compile('^config system interface$(?:(?!^end$).)+^end$', re.MULTILINE | re.DOTALL)
configuracoes = r.findall(conteudo)
# imprime as configurações encontradas
for c in configuracoes:
print(c)
Although it has fewer lines than the previous solution, it is not necessarily simpler and/or more efficient¹.
First this solution uses the method read
, which loads all the contents of the file into memory (different from the previous solution, which reads one line at a time). For small files it won’t make so much difference, but if you are processing large files, you may have high memory consumption problems reading everything at once.
As for the regex, she uses the markers ^
and $
, which are respectively the beginning and end of the string. But thanks to the flag MULTILINE
, they also correspond to the beginning and end of a line.
Hence the excerpt ^config system interface$
checks a line that contains exactly "config system interface" (not one more character or less, the line should have exactly that). The same goes for ^end$
.
In between we have (?:(?!^end$).)+
. Explaining from the inside out:
(?!^end$)
is a Negative Lookahead, that serves to check if something nay exists in front. In case, it checks if there is no ^end$
, that is, if the line is not "end". With this, we guarantee that we are not at the end of a configuration block.
- the point, by default, corresponds to any character except line breaks. But thanks to flag
DOTALL
, it also corresponds to line breaks.
Then the Negative Lookahead combined with .
means "any character as long as it is not in a line with end
". That is, anything that is inside the block that we want.
All this is in parentheses, followed by the quantifier +
, meaning "one or more occurrences". I mean, I’m picking up several characters, as long as they’re not a line with "end".
Parentheses use the syntax (?:
, that form a catch group. If I only use parentheses, without the ?:
, they form a capture group, and this is an important detail as the method findall
returns only the capture groups, if they are present. As I want the whole stretch, I switch to catch group, so the findall
return all desired configuration block.
The method findall
returns a list of all captured configuration snippets. Just to better understand the difference, if I use a capture group (placing parentheses around this excerpt):
r = re.compile('^config system interface$((?:(?!^end$).)+)^end$', re.MULTILINE | re.DOTALL)
^ ^
The method findall
returns only the corresponding chunk, ie the whole configuration block, except for the beginning ("config system interface") and the end ("end").
Although the regex works, I don’t think it’s the most appropriate solution for this case. As well as loading the entire file into memory (which may or may not make a difference, depending on the size of the files you will read), regex is not very efficient, since the Lookahead makes her check what’s in front and then come back (and do it over and over again), to make sure we’re not crossing the line that has "end" (see an example of her working).
Another alternative is to use ^config system interface$.*?^end$
, which is a little more efficient than the Lookahead (see here and compare the amount of steps with the previous regex). Now I simply use .*
(zero or more characters), and the ?
then makes the quantifier "lazy" and only take as few characters as necessary (with this, it only goes to the next line that has "end"). If you do not use the ?
, he take as many characters as possible and will go to the last end
which may include lines from other configuration blocks.
(1): Of course, for small files it won’t make so much difference to use regex or read line by line, but depending on the files being read, it is important to keep in mind that there may be differences according to the chosen solution. It’s up to you to test every case.
There are several ways to do this, but using this "strategy" qq one of them would be slower and "confusing" than if you used a more appropriate approach for configuration purposes (json, ini, yaml, etc), in the case of python I would use yaml, a good reference is that link
– Sidon
Alias, looking closely this file is already almost formatted in yaml pattern, if you make some adjustments now... If you cannot change the file, then you will have to do a 'parse' according to the given pattern.
– Sidon
Thanks for the help..
– Bruno