How to recover a specific chunk of text

Question

How to recover a specific chunk of text

Asked 6 years, 1 month ago

Viewed 257 times

1

I have a configuration file and need to get a specific chunk of a configuration, which would be the best way to do this?

In the example below I need to take the section that starts in "config system interface" and ends in the next "end".

config system accprofile
    edit "prof_admin"
        set mntgrp read-write
        set admingrp read-write
        set updategrp read-write
        set authgrp read-write
        set sysgrp read-write
        set netgrp read-write
        set loggrp read-write
        set routegrp read-write
        set fwgrp read-write
        set vpngrp read-write
        set utmgrp read-write
        set wanoptgrp read-write
        set endpoint-control-grp read-write
        set wifi read-write
    next
end
config system interface
    edit "port1"
        set vdom "root"
        set ip 192.168.0.150 255.255.255.0
        set allowaccess ping https ssh http telnet
        set type physical
        set role wan
        set snmp-index 1
    next
    edit "port2"
        set vdom "root"
        set type physical
        set role wan
        set snmp-index 2
    next
end

There are several ways to do this, but using this "strategy" qq one of them would be slower and "confusing" than if you used a more appropriate approach for configuration purposes (json, ini, yaml, etc), in the case of python I would use yaml, a good reference is that link

– Sidon

2019/06/23 at 15:26
Alias, looking closely this file is already almost formatted in yaml pattern, if you make some adjustments now... If you cannot change the file, then you will have to do a 'parse' according to the given pattern.

– Sidon

2019/06/23 at 15:35
Thanks for the help..

– Bruno

2019/06/24 at 20:43

2 answers

0

If the file always follows this structure (assuming, for example, that there will not be a "config" block within another), you can simply read the file line by line.

When you find the line containing config system interface, you mark that a configuration block has been started. From there, just go saving all the subsequent lines, until you find a line that only has end:

configuracoes = [] # guarda todos os trechos 'config system interface' encontrados

with open('arquivo_de_configuracao.txt') as file:
    dentro_do_bloco = False # verifica se está dentro de um bloco de config desejado
    config = [] # guarda a config atual

    for linha in file: # lê o arquivo linha a linha
        linha = linha.strip('\n') # retirar quebra de linha
        if linha == 'config system interface':
            dentro_do_bloco = True # iniciou o bloco
            config.append(linha)
        elif dentro_do_bloco:
            config.append(linha)
            if linha == 'end':
                dentro_do_bloco = False # terminou o bloco
                # junta tudo e guarda na lista de configs encontradas
                configuracoes.append('\n'.join(config))
                config = []

# imprime as configurações encontradas
for c in configuracoes:
    print(c)

This solution assumes that there are no nested configuration blocks (and therefore no end within the block itself, as it determines that the current config has ended).

Also note the use of with: this ensures that the file will be closed, even if an error occurs during execution.

Regex

The above solution I consider the simplest option, but how you used the tag regex in the question, follows an alternative, using the module re:

import re

with open('/tmp/arq.txt') as file:
    conteudo = file.read()
    r = re.compile('^config system interface$(?:(?!^end$).)+^end$', re.MULTILINE | re.DOTALL)
    configuracoes = r.findall(conteudo)

# imprime as configurações encontradas
for c in configuracoes:
    print(c)

Although it has fewer lines than the previous solution, it is not necessarily simpler and/or more efficient¹.

First this solution uses the method read, which loads all the contents of the file into memory (different from the previous solution, which reads one line at a time). For small files it won’t make so much difference, but if you are processing large files, you may have high memory consumption problems reading everything at once.

As for the regex, she uses the markers ^ and $, which are respectively the beginning and end of the string. But thanks to the flag MULTILINE, they also correspond to the beginning and end of a line.

Hence the excerpt ^config system interface$ checks a line that contains exactly "config system interface" (not one more character or less, the line should have exactly that). The same goes for ^end$.

In between we have (?:(?!^end$).)+. Explaining from the inside out:

(?!^end$) is a Negative Lookahead, that serves to check if something nay exists in front. In case, it checks if there is no ^end$, that is, if the line is not "end". With this, we guarantee that we are not at the end of a configuration block.
the point, by default, corresponds to any character except line breaks. But thanks to flag DOTALL, it also corresponds to line breaks.

Then the Negative Lookahead combined with . means "any character as long as it is not in a line with end". That is, anything that is inside the block that we want.

All this is in parentheses, followed by the quantifier +, meaning "one or more occurrences". I mean, I’m picking up several characters, as long as they’re not a line with "end".

Parentheses use the syntax (?:, that form a catch group. If I only use parentheses, without the ?:, they form a capture group, and this is an important detail as the method findall returns only the capture groups, if they are present. As I want the whole stretch, I switch to catch group, so the findall return all desired configuration block.

The method findall returns a list of all captured configuration snippets. Just to better understand the difference, if I use a capture group (placing parentheses around this excerpt):

r = re.compile('^config system interface$((?:(?!^end$).)+)^end$', re.MULTILINE | re.DOTALL)
                                         ^               ^

The method findall returns only the corresponding chunk, ie the whole configuration block, except for the beginning ("config system interface") and the end ("end").

Although the regex works, I don’t think it’s the most appropriate solution for this case. As well as loading the entire file into memory (which may or may not make a difference, depending on the size of the files you will read), regex is not very efficient, since the Lookahead makes her check what’s in front and then come back (and do it over and over again), to make sure we’re not crossing the line that has "end" (see an example of her working).

Another alternative is to use ^config system interface$.*?^end$, which is a little more efficient than the Lookahead (see here and compare the amount of steps with the previous regex). Now I simply use .* (zero or more characters), and the ? then makes the quantifier "lazy" and only take as few characters as necessary (with this, it only goes to the next line that has "end"). If you do not use the ?, he take as many characters as possible and will go to the last end which may include lines from other configuration blocks.

(1): Of course, for small files it won’t make so much difference to use regex or read line by line, but depending on the files being read, it is important to keep in mind that there may be differences according to the chosen solution. It’s up to you to test every case.

The two alternatives are running perfectly, but it is important to think about the performance of the program. Thanks for the attention and for the information you explained step by step. grateful.

– Bruno

2019/06/24 at 20:56

Browser other questions tagged python string regex

You are not signed in. Login or sign up in order to post.

by Éder Garcia • **785** points · Answer 1 · 2019-06-23T18:51:11+00:00

To test my answer save your configuration file as config.txt and the code that follows saves as extrair_trecho.py

In the variables inicio and fim you put the strings that marks the beginning and the end of the chunk you want to extract from the file config.txt.

In the first for we go through all the lines of arquivo.

If inicio is in the linha then copiar = True and lines will be added in the list trecho until the variable fim be equal to linha.

At last for all lines in the list trecho are printed on the screen.

arquivo = "config.txt"
inicio = "config system interface"
fim = "end"
copiar = False
trecho = []

arq = open(arquivo, "r")

for linha in arq :

        if copiar == True:
                trecho.append(linha)

        if inicio in linha:
                copiar = True
                trecho.append(linha)

        if fim in linha:
                copiar = False

arq.close()

for linha in trecho:
        print(linha)

Exit:

config system interface

    edit "port1"

        set vdom "root"

        set ip 192.168.0.150 255.255.255.0

        set allowaccess ping https ssh http telnet

        set type physical

        set role wan

        set snmp-index 1

    next

    edit "port2"

        set vdom "root"

        set type physical

        set role wan

        set snmp-index 2

    next

end