Reading and processing XML file from a CVE (Common Vulnerabilities and Exposures) database with Python

Question

Reading and processing XML file from a CVE (Common Vulnerabilities and Exposures) database with Python

Asked 4 years, 3 months ago

Viewed 181 times

2

By using the Annexes lista.txt and cve.xml develop a Python script that:

open and read the ". xml" extension file made available at the end;
search the file for all occurrences of CVE ID in the format: "", where YYYY is the year, with 4 digits, and NNNN is the CVE number, which can have 4, 5, 6 or 7 digits;
confirm that recovered Cves from the file exist in a locally stored list ("list.txt" file). If that CVE is not in the lista.txt, the script should include it at the end of the list, saving 01(um) field "CVE-YYYY-NNNN" per line; and
print on screen "Novo CVE encontrado: CVE-xxxx-yyyy" for each new CVE saved in the "list.txt" file. If no new CVE is found in the ". xml" print on the screen "Sem CVE Novo".

txt list. : https://drive.google.com/open?id=1RBbcrXQkfymjkXC-sP_nuiz9S3qO15dZ

cve.xml : https://drive.google.com/open?id=10l4oC9rmbdz4CbXwDukMk1gG5yANFdcY

I could only open the XML file:

import xml.etree.ElementTree as ET

tree = ET.parse('cve.xml')

Someone could help?

I think you can solve it too just by using regex. If anyone can show both solutions, it will help me a lot in my learning!

2 answers

2

First you load the CVE list from the file lista.txt:

# carregar lista de CVE
with open('lista.txt') as arq:
    # remover as quebras de linha do final das linhas
    cve_list = [ linha.rstrip() for linha in arq ]

Note the use of with to ensure that the file is closed at the end.

Then you scroll through the XML and for each CVE you check if it is already in the list. If it is not, add in a list of new CVE’s found:

novos = [] # guardar os novos CVE's encontrados

import xml.etree.ElementTree as ET

tree = ET.parse('cve.xml')
root = tree.getroot() # root (ExploitPackList)
for canvas in root: # para cada CANVASExploitPack 
    exploits = canvas[0] # pega a tag Exploits
    for exploit in exploits: # para cada Exploit
        cve = exploit.attrib['cve']
        if cve not in cve_list and cve not in novos:
            print('Novo CVE encontrado:', cve)
            novos.append(cve)

Here I am assuming that the structure is exactly the one in the file:

<ExploitPackList>
    <CANVASExploitPack date="Fri Jul  5 11:03:08 2013" name="White_Phosphorus">
        <Exploits>
            várias tags <Exploit> contendo o CVE...
        </Exploits>
    </CANVASExploitPack>
    <CANVASExploitPack date="Fri Jul  5 11:03:08 2013" name="CANVAS">
        <Exploits>
            <Exploit cve="CVE-2019-5056" desc="Open-Realty &lt;= 2.4.3 Remote Code Execution" name="openrealty_exec"/>
            várias tags <Exploit> contendo o CVE...

That is, inside of <ExploitPackList> may have several <CANVASExploitPack>, which in turn has only one <Exploits>, containing multiple tags <Exploit>.

I’m also checking that the CVE is not on the list obtained from lista.txt and also not on the list of new CVE’s found (I do not know if there is repetition in this file, so it may be that checking the list of new ones is redundant).

Then you take the list of new CVE’s and add at the end of the file (or print the message that none of them were found, if the list is empty):

if novos: # se a lista de novos não está vazia: https://docs.python.org/3/library/stdtypes.html#truth-value-testing
    with open('lista.txt', 'a') as arq: # opção "a" para adicionar conteúdo no final do arquivo
        for cve in novos:
            arq.write(f'\n{cve}')
else:
    print('Sem CVE Novo')

Of course you could check with if len(novos) > 0:, but how an empty list is considered False, I can just do if novos to find out if the list novos has some element.

I saw that file lista.txt does not end with line break, so I included the \n before CVE. Thus the last CVE will not have the line break after it, and successive executions of the program will keep the file with a CVE per line.

One detail is that in XML there are several CVE’s that do not start with "CVE-". If you want to validate this format as well, you can use a regex:

import xml.etree.ElementTree as ET
import re

r = re.compile(r'^CVE-\d{4}-\d{4,7}$')
tree = ET.parse('cve.xml')
root = tree.getroot()
for canvas in root:
    exploits = canvas[0]
    for exploit in exploits:
        cve = exploit.attrib['cve']
        if r.match(cve) and cve not in cve_list and cve not in novos:
            print('Novo CVE encontrado:', cve)
            novos.append(cve)

In this case, the markers ^ and $ indicate respectively the beginning and end of the string, thus ensuring that it only has what is indicated in regex.

The shortcut \d corresponds to digits and quantifiers {4} and {4,7} indicate, respectively, "exactly 4" and "at least 4, at most 7".

I would use regex only for this validation. I know a lot of people must think that they could also use it to fetch XML data, something like:

r = re.compile(r'<Exploit cve="(CVE-\d{4}-\d{4,7})"')
with open('cve.xml') as arq:
    for linha in arq:
        m = r.search(linha)
        if m: # se encontrou CVE no formato indicado
            cve = m.group(1)
            # verifica se está na lista, etc...

For simple cases it may even work, but regex is not the right tool for this task. Just change a little the XML that no longer works. For example, and if you have a commented snippet:

    <!--
    <Exploit cve="CVE-2020-2240" desc="IBM Lotus Domino Web Server Accept-Language HTTP Header Buffer Overflow Vulnerability" name="d2sec_lotus_domino_http"/>
    <Exploit cve="CVE-2016-0915" desc="IBM Lotus Domino iCalendar Meeting Request Stack Overflow Vulnerability" name="d2sec_lotuscal2"/>
    <Exploit cve="CVE-2019-4467" desc="Oracle JInitiator ActiveX Buffer Overflow" name="d2sec_jinitiator"/>
-->

The xml.etree.ElementTree can correctly detect and ignore the above tags, but regex cannot. This is because the parser can analyze the context in which each tag is, but regex only evaluates the passage that we indicate (<Exploit cve="etc...).

It is even possible to make a regex to check if the excerpt is within a comment, but is it worth doing something like that, while using the parser the code is the same (besides being much simpler)?

And this is just a case, because there are several other situations that will require you to change the regex, and it will become more and more complicated (see some examples here and here - although these links deal with HTML, the same explanations apply to XML).

Regex is cool, I like it a lot, but is not always the best solution.

Browser other questions tagged python python-3.x regex xml

You are not signed in. Login or sign up in order to post.

by Paz • **3,062** points · Answer 1 · 2020-03-23T15:01:24+00:00

The regex you should use is:

CVE-\d{4}-\d{4,7}

How it works:

She searches for the sequence "CVE -"
If successful it checks if there are 4 digits then "\d{4}"
If success she checks if the sequence exists "-"
If successful it captures 4 to 7 digits then giving priority to capture more digits "\d{4,7}"

Soon your code should look like this, I included some comments explaining the process:

import lxml.etree as et
import re

#Declarando variáveis que usaremos para controlar se já encontramos novos CVE's

matchCounter = 0
newCveList = []

#Parte 1
#Vamos abrir o xml e transforma-lo em uma string para conseguirmos usar regex nele
xml = et.parse(
    './cve.xml')
contentXmlOnBytes = et.tostring(xml, pretty_print=False)
contentXmlOnString = contentXmlOnBytes.decode("utf-8")

#Parte 2
#Vamos pegar todos os CVE's do xml com regex e do arquivo txt
allXmlCve = re.findall(r'CVE-\d{4}-\d{4,7}', contentXmlOnString)

with open('./lista.txt') as arq:
    txtCveList = [linha.rstrip() for linha in arq]

#Parte 3 e 4
#Vamos comparar os resultados das listas e utilizar as 
#variaveis de controle para verificarmos se houveram novos CVE's 
#e imprimir os resultados
for cve in allXmlCve:
    if cve not in txtCveList and cve not in newCveList:
        print('Novo CVE encontrado:', cve)
        newCveList.append(cve)
        matchCounter = matchCounter + 1

if (matchCounter > 0):  
    with open('./lista.txt', 'a') as arq:
        for newCve in newCveList:
            arq.write(f'\n{newCve}')


if(matchCounter == 0):
    print('Sem CVE Novo')